Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue merging across whitespaces #1475

Closed
henrycharlesworth opened this issue Mar 21, 2024 · 2 comments
Closed

Issue merging across whitespaces #1475

henrycharlesworth opened this issue Mar 21, 2024 · 2 comments
Labels

Comments

@henrycharlesworth
Copy link

My use case is that I have a corpus of instructions, e.g. taking a small snapshot:

push rbp <eoi>
push rbx <eoi>
mov rbx , rdi <eoi>
sub rsp , 0x10 <eoi>
mov rdi , qword PTR <STACK_ADDR> <eoi>

Each line is a separate "instruction", and I want to allow merges across the whole instruction (including whitespaces). My first attempt at this is:

tokenizer = Tokenizer(BPE())
tokenizer.normalizer = NormalizerSequence([Replace(",", "")])
tokenizer.pre_tokenizer = PreTokenizerSequence([Split("\n", behavior="removed")])

trainer = BpeTrainer(vocab_size=10000, min_frequency=2, special_tokens=["<s>", "<pad>", "</s>", "<unk>"])
tokenizer.train(files=["experimenting/tokenizer_tmp/full_corpus_merged_split_0.txt"], trainer=trainer)

This basically seems to work, although there are a few issues. One is that it's hard to interpret the merge list in the json file when I save it - which I suspect it related to the main issue, which is that I can't load the tokenizer after training it.

When I run:

loaded_tokenizer = PreTrainedTokenizerFast(tokenizer_file="tokenizer.json")

I get an error:
Exception: data did not match any variant of untagged enum ModelWrapper at line 11182 column 3

It seems like it is related to: #566
I followed to: #909
which hasn't been merged, and suggests that merging with spaces probably isn't supported yet (without this PR). But things largely work up until the point where I try to load the tokenizer, which is strange.

Does anyone have any suggestions for getting around this, or an alternative approach to do the same thing?

Copy link

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

@github-actions github-actions bot added the Stale label Apr 21, 2024
@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Apr 27, 2024
@wrongbad
Copy link

I was running into a similar issue. I also wanted to model the text whitespace/formatting, without "losing" it to the pre-tokenizer.

I think I found a workaround:

tokenizer = Tokenizer(BPE())
tokenizer.pre_tokenizer = pre_tokenizers.Metaspace()
tokenizer.decoder = decoders.Metaspace()

The Metaspace class uses a rare unicode character as a placeholder for space, which solves the json file reading issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants