Issue merging across whitespaces #1475

henrycharlesworth · 2024-03-21T14:41:32Z

My use case is that I have a corpus of instructions, e.g. taking a small snapshot:

push rbp <eoi>
push rbx <eoi>
mov rbx , rdi <eoi>
sub rsp , 0x10 <eoi>
mov rdi , qword PTR <STACK_ADDR> <eoi>

Each line is a separate "instruction", and I want to allow merges across the whole instruction (including whitespaces). My first attempt at this is:

tokenizer = Tokenizer(BPE())
tokenizer.normalizer = NormalizerSequence([Replace(",", "")])
tokenizer.pre_tokenizer = PreTokenizerSequence([Split("\n", behavior="removed")])

trainer = BpeTrainer(vocab_size=10000, min_frequency=2, special_tokens=["<s>", "<pad>", "</s>", "<unk>"])
tokenizer.train(files=["experimenting/tokenizer_tmp/full_corpus_merged_split_0.txt"], trainer=trainer)

This basically seems to work, although there are a few issues. One is that it's hard to interpret the merge list in the json file when I save it - which I suspect it related to the main issue, which is that I can't load the tokenizer after training it.

When I run:

loaded_tokenizer = PreTrainedTokenizerFast(tokenizer_file="tokenizer.json")

I get an error:
Exception: data did not match any variant of untagged enum ModelWrapper at line 11182 column 3

It seems like it is related to: #566
I followed to: #909
which hasn't been merged, and suggests that merging with spaces probably isn't supported yet (without this PR). But things largely work up until the point where I try to load the tokenizer, which is strange.

Does anyone have any suggestions for getting around this, or an alternative approach to do the same thing?

The text was updated successfully, but these errors were encountered:

github-actions · 2024-04-21T01:49:59Z

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

wrongbad · 2024-05-20T01:04:58Z

I was running into a similar issue. I also wanted to model the text whitespace/formatting, without "losing" it to the pre-tokenizer.

I think I found a workaround:

tokenizer = Tokenizer(BPE())
tokenizer.pre_tokenizer = pre_tokenizers.Metaspace()
tokenizer.decoder = decoders.Metaspace()

The Metaspace class uses a rare unicode character as a placeholder for space, which solves the json file reading issue.

github-actions bot added the Stale label Apr 21, 2024

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Apr 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue merging across whitespaces #1475

Issue merging across whitespaces #1475

henrycharlesworth commented Mar 21, 2024

github-actions bot commented Apr 21, 2024

wrongbad commented May 20, 2024

Issue merging across whitespaces #1475

Issue merging across whitespaces #1475

Comments

henrycharlesworth commented Mar 21, 2024

github-actions bot commented Apr 21, 2024

wrongbad commented May 20, 2024