Detokenizer fixes #8039

jaime-m-p · 2024-06-20T16:58:18Z

This PR tries to solve most common problems with detokenization (ie: spaces after special tokens).

Related issues: #8023, #7938.

I have read the contributing guidelines
Self-reported review complexity:
- Low
- Medium
- High

jaime-m-p · 2024-06-20T17:15:21Z

Initial detokenizer state:

VOCABS \ TESTS:	test-tokenizer-0	test-tokenizer-1-bpe	test-tokenizer-1-spm
ggml-vocab-aquila.gguf	ERROR	OK	-
ggml-vocab-baichuan.gguf	ERROR	-	ERROR
ggml-vocab-bert-bge.gguf	OK	-	-
ggml-vocab-command-r.gguf	OK	OK	-
ggml-vocab-deepseek-coder.gguf	ERROR	OK	-
ggml-vocab-deepseek-llm.gguf	ERROR	OK	-
ggml-vocab-falcon.gguf	OK	OK	-
ggml-vocab-gpt-2.gguf	OK	OK	-
ggml-vocab-gpt-neox.gguf	ERROR	OK	-
ggml-vocab-jina-v2-code.gguf	OK	OK	-
ggml-vocab-jina-v2-de.gguf	OK	OK	-
ggml-vocab-jina-v2-en.gguf	OK	-	-
ggml-vocab-jina-v2-es.gguf	OK	OK	-
ggml-vocab-llama-bpe.gguf	OK	OK	-
ggml-vocab-llama-spm.gguf	OK	-	ERROR
ggml-vocab-mpt.gguf	ERROR	OK	-
ggml-vocab-olmo.gguf	ERROR	OK	-
ggml-vocab-phi-3.gguf	OK	-	ERROR
ggml-vocab-poro-chat.gguf	OK	OK	-
ggml-vocab-refact.gguf	OK	OK	-
ggml-vocab-smaug-bpe.gguf	OK	OK	-
ggml-vocab-starcoder.gguf	OK	OK	-

jaime-m-p · 2024-06-20T17:52:47Z

Real initial state:

VOCAB \ TESTS:	test-tokenizer-0	test-tokenizer-1-bpe	test-tokenizer-1-spm
ggml-vocab-aquila.gguf	ERROR	OK	-
ggml-vocab-baichuan.gguf	ERROR	-	ERROR
ggml-vocab-bert-bge.gguf	OK	-	-
ggml-vocab-command-r.gguf	OK	OK	-
ggml-vocab-deepseek-coder.gguf	OK	OK	-
ggml-vocab-deepseek-llm.gguf	OK	OK	-
ggml-vocab-falcon.gguf	OK	ERROR	-
ggml-vocab-gpt-2.gguf	OK	OK	-
ggml-vocab-gpt-neox.gguf	ERROR	OK	-
ggml-vocab-jina-v2-code.gguf	OK	OK	-
ggml-vocab-jina-v2-de.gguf	OK	OK	-
ggml-vocab-jina-v2-en.gguf	OK	-	-
ggml-vocab-jina-v2-es.gguf	OK	OK	-
ggml-vocab-llama-bpe.gguf	OK	OK	-
ggml-vocab-llama-spm.gguf	OK	-	ERROR
ggml-vocab-mpt.gguf	OK	OK	-
ggml-vocab-olmo.gguf	OK	OK	-
ggml-vocab-phi-2.gguf	OK	OK	-
ggml-vocab-phi-3.gguf	OK	-	ERROR
ggml-vocab-poro-chat.gguf	OK	OK	-
ggml-vocab-refact.gguf	OK	ERROR	-
ggml-vocab-smaug-bpe.gguf	OK	OK	-
ggml-vocab-stablelm2.gguf	OK	OK	-
ggml-vocab-starcoder.gguf	OK	ERROR	-

Add detokenizer checks New generator: ascii_lr_strip New generator: apostrophe Add more vocabs files

jaime-m-p · 2024-06-20T19:48:18Z

Brute force encoding and decoding tests (number of errors, * >= 10 errors):

VOCABS \ TESTS:	added_lr_strip	apostrophe	ascii_lr_strip	unicodes	vocab_words
bert-bge	*	*	*	*	*
deepseek-coder
deepseek-llm
falcon		*	*
gpt-2
jina-v2-code
jina-v2-de
jina-v2-en	*	*	*	*	*
jina-v2-es
llama-bpe
llama-spm	*	*	*	*	*
mpt				*
olmo				*
phi-2
phi-3	*	*	*	*	*
poro-chat
qwen2				*	*
refact
smaug-bpe
stablelm2
starcoder

jaime-m-p · 2024-06-20T22:02:32Z

Improvements:

VOCAB \ TESTS:	test-tokenizer-0	test-tokenizer-1-bpe	test-tokenizer-1-spm
ggml-vocab-aquila.gguf	ERROR	OK	-
ggml-vocab-baichuan.gguf	ERROR	-	OK
ggml-vocab-bert-bge.gguf	OK	-	-
ggml-vocab-command-r.gguf	OK	OK	-
ggml-vocab-deepseek-coder.gguf	OK	OK	-
ggml-vocab-deepseek-llm.gguf	OK	OK	-
ggml-vocab-falcon.gguf	OK	ERROR	-
ggml-vocab-gpt-2.gguf	OK	OK	-
ggml-vocab-gpt-neox.gguf	ERROR	OK	-
ggml-vocab-jina-v2-code.gguf	OK	OK	-
ggml-vocab-jina-v2-de.gguf	OK	OK	-
ggml-vocab-jina-v2-en.gguf	OK	-	-
ggml-vocab-jina-v2-es.gguf	OK	OK	-
ggml-vocab-llama-bpe.gguf	OK	OK	-
ggml-vocab-llama-spm.gguf	OK	-	OK
ggml-vocab-mpt.gguf	OK	OK	-
ggml-vocab-olmo.gguf	OK	OK	-
ggml-vocab-phi-2.gguf	OK	OK	-
ggml-vocab-phi-3.gguf	OK	-	OK
ggml-vocab-poro-chat.gguf	OK	OK	-
ggml-vocab-refact.gguf	OK	ERROR	-
ggml-vocab-smaug-bpe.gguf	OK	OK	-
ggml-vocab-stablelm.gguf	OK	OK	-
ggml-vocab-starcoder.gguf	OK	ERROR	-

cmp-nct · 2024-06-23T00:50:00Z

ID: 1, Content: '<s>'
ID: 4321, Content: '▁Test'
ID: 29901, Content: ':'
ID: 32010, Content: '<|user|>'
ID: 910, Content: '▁This'
ID: 338, Content: '▁is'
ID: 263, Content: '▁a'
ID: 1243, Content: '▁test'
ID: 32007, Content: '<|end|>'
ID: 32001, Content: '<|assistant|>'

Now I ran the same through llama.cpp tokenization:

<|assistant|>':
  4321 -> ' Test'
 29901 -> ':'
 32010 -> '<|user|>'
   910 -> ' This'
   338 -> ' is'
   263 -> ' a'
  1243 -> ' test'
 32007 -> '<|end|>'
 29871 -> ' '
    13 -> '
'
 32001 -> '<|assistant|>'

Update:
It looks like in python \n are removed if they come after a special token (which is ironic given the finetune templates ).
The add whitespace prefix probably works correct on llama.cpp IF the newlines were to be trimmed.
I don't know the logic behind those special handlings

I mimicked the python tokenizer by adding that into llama.cpp:

                        // remove potential newlines after special tokens
                        if (vocab.tokenizer_add_space_prefix & is_prev_special) {
                            while (raw_text.length() > 0 && (raw_text[0] == '\n')) {
                                raw_text = raw_text.substr(1);
                            }
                            if(raw_text.length() == 0) continue;
                        }

right before

                        // prefix with space if previous is special
                        if (vocab.tokenizer_add_space_prefix && is_prev_special) {
                            raw_text = " " + raw_text;
                        }

Update2
I made another change, this time to prevent mutating single spaces into double spaces.
Python doesn't add a space if a space is already present.

if (vocab.tokenizer_add_space_prefix && is_prev_special && raw_text.length() > 0 && !isspace(raw_text[0])){

That would result in identical tokenization:

Tokens for 'Test:<|user|>This is a test<|end|><|assistant|>':
  4321 -> ' Test'
 29901 -> ':'
 32010 -> '<|user|>'
   910 -> ' This'
   338 -> ' is'
   263 -> ' a'
  1243 -> ' test'
 32007 -> '<|end|>'
 32001 -> '<|assistant|>'

Useful when automating tests: - If you don't know in advance the vocab type. - Differenciate other loading errors.

Using exit() is throwing random exceptions

UNKNOWN and CONTROL are 'special pieces'. Remove space after UNKNOWN and CONTROL. Refactor llama_token_to_piece().

jaime-m-p · 2024-06-23T19:26:46Z

The models baichuan, falcon and mpt have tokenizations errors, so detokenization fails too.

VOCAB \ TESTS:	test-tokenizer-0	test-tokenizer-1-bpe	test-tokenizer-1-spm
ggml-vocab-aquila.gguf	OK	OK	-
ggml-vocab-baichuan.gguf	ERROR	-	OK
ggml-vocab-bert-bge.gguf	OK	-	-
ggml-vocab-command-r.gguf	OK	OK	-
ggml-vocab-deepseek-coder.gguf	OK	OK	-
ggml-vocab-deepseek-llm.gguf	OK	OK	-
ggml-vocab-falcon.gguf	OK	ERROR	-
ggml-vocab-gpt-2.gguf	OK	OK	-
ggml-vocab-gpt-neox.gguf	OK	OK	-
ggml-vocab-jina-v2-code.gguf	OK	OK	-
ggml-vocab-jina-v2-de.gguf	OK	OK	-
ggml-vocab-jina-v2-en.gguf	OK	-	-
ggml-vocab-jina-v2-es.gguf	OK	OK	-
ggml-vocab-llama-bpe.gguf	OK	OK	-
ggml-vocab-llama-spm.gguf	OK	-	OK
ggml-vocab-mpt.gguf	ERROR	OK	-
ggml-vocab-olmo.gguf	OK	OK	-
ggml-vocab-phi-2.gguf	OK	OK	-
ggml-vocab-phi-3.gguf	OK	-	OK
ggml-vocab-poro-chat.gguf	OK	OK	-
ggml-vocab-refact.gguf	OK	OK	-
ggml-vocab-smaug-bpe.gguf	OK	OK	-
ggml-vocab-starcoder.gguf	OK	OK	-

jaime-m-p · 2024-06-23T20:57:53Z

@cmp-nct

It looks like in python \n are removed if they come after a special token

Not all special tokens, see the attributes lstrip and rstrip in file tokenizer.json:

    {
      "id": 32007,
      "content": "<|end|>",
      "single_word": false,
      "lstrip": false,
      "rstrip": TRUE,
      "normalized": false,
      "special": true
    },

You can see the lstrip and rstrip implementation #7749.
But, since this attributes are not already stored in the GGUF file, I had to hardcode model names (phi-3 and jina-*)..

llama.cpp/llama.cpp

Lines 5202 to 5217 in e112b61

    
                   // set attributes by model/tokenizer name 
        
                   if (_contains_any(tokenizer_pre, {"jina-v2-de", "jina-v2-es", "jina-v2-code"})) { 
        
                       _set_token_attr("<mask>", LLAMA_TOKEN_ATTR_LSTRIP, true); 
        
                   } else if (_contains_any(model_name, {"phi-3", "phi3"})) { 
        
                       for (auto id : vocab.cache_special_tokens) { 
        
                           _set_tokenid_attr(id, LLAMA_TOKEN_ATTR_RSTRIP, true); 
        
                       } 
        
                       for (auto token : {"</s>"}) { 
        
                           _set_token_attr(token, LLAMA_TOKEN_ATTR_RSTRIP, true); 
        
                       } 
        
                       for (auto token : {"<unk>", "<s>", "<|endoftext|>"}) { 
        
                           _set_token_attr(token, LLAMA_TOKEN_ATTR_RSTRIP, false); 
        
                       } 
        
                   } 
        
               } 
        
           }

If llama.cpp is inserting a \n after <end> then there is a bug or you are using a previous version or the model name does not match.
Please, can you check?

cmp-nct · 2024-06-24T02:24:31Z

@jaime-m-p
Oh, I didn't know about that behavior. Explains a lot!

The \n was manually added by me based on the official template by Microsoft

But as you pointed out: END has "rstrip" true, so the newline is removed.
model name special behavior
That's great to know. It also explains why it's not working for me!
I am testing this on a recent xtuner phi3-mini model (llava type) which comes with generic.name == "LLaMA v2"

I'll repeat the tests after fixing those issues and reverting my changes. Given your results that's promising.

It's a little troublesome that such errors can very easily sneak into a model and it's very hard to notice them, even harder to fix them without blindly recreating the model from originals.
Likely the majority of GGUF files on huggingspace are flawed in some way.

Harsha-Nori · 2024-06-24T04:53:26Z

Hi @jaime-m-p and @cmp-nct, really grateful you both are looking into this! I'm traveling without reliable access to a computer at the moment, but wanted to ask if these fixes now keep stability on retokenization with Phi-3 (i.e. the roundtrip of text -> tokens -> text -> tokens results in the same tokens). The constant whitespace insertion on each cycle was causing serious kv-cache reuse issues on our side and I'm really hopeful that this update resolves it!

Detokenize special tokens. Replace errors with '\uFFFD' when detokenizing to 'utf-8'. More edge cases. Better detokenization results check.

jaime-m-p · 2024-06-24T22:43:30Z

Overall current tokenize and detokenize state.

WPM models (bert-bge, jina-v2-en) are still broken. Probably due to the unicode NFD normalization.

BPE models qwen2, olmo and mpt are probably faling due to the missing unicode NFC normalization.
See "normalizer": { "type": "NFC" } in tokenizer.json files.

All BPE and SPM models seems to detokenize properly.

Each cell show the number of tokenization and detokenization errros (up to 10). Empty cell means 0 errors.
NOTE: There are more failings tests not included in this table.

VOCABS \ TESTS:	added_lr_strip	apostrophe	ascii_lr_strip	unicodes	vocab_words
baichuan	10 - 10	10 - 10	10 - 10	10 - 10	10 - 10
bert-bge	0 - 10	0 - 10	0 - 10	10 - 10	0 - 10
deepseek-coder
deepseek-llm
falcon		10 - 0	10 - 0
gpt-2
jina-v2-code
jina-v2-de
jina-v2-en	0 - 10	0 - 10	0 - 10	10 - 10	0 - 10
jina-v2-es
llama-bpe
llama-spm
mpt		10 - 0	10 - 0	10 - 0	10 - 0
olmo				10 - 0
phi-2
phi-3
poro-chat
qwen2				10 - 0	10 - 0
refact
smaug-bpe
stablelm2
starcoder

jaime-m-p · 2024-06-25T16:36:57Z

@Harsha-Nori

Hi @jaime-m-p and @cmp-nct, really grateful you both are looking into this! I'm traveling without reliable access to a computer at the moment, but wanted to ask if these fixes now keep stability on retokenization with Phi-3 (i.e. the roundtrip of text -> tokens -> text -> tokens results in the same tokens). The constant whitespace insertion on each cycle was causing serious kv-cache reuse issues on our side and I'm really hopeful that this update resolves it!

AutoTokenizer is not completing this roundtrip either for some models.

llama-bpe
                 ' \x00z \x07z \x0ez \x15z \x1cz  z !z "z $z %z &z (z )z *z +z ,z -'  # input text
'<|begin_of_text|> \x00z \x07z \x0ez \x15z \x1cz  z!z "z $z %z &z (z )z *z +z,z -'  # AutoTokenizer
'<|begin_of_text|> \x00z \x07z \x0ez \x15z \x1cz  z!z "z $z %z &z (z )z *z +z,z -'  # Llama.cpp

phi-3
    ' \x00z \x07z \x0ez \x15z \x1cz  z !z "z $z %z &z (z )z *z +z ,z -'  # input text
'<s>  \x00z \x07z \x0ez \x15z \x1cz  z !z "z $z %z &z (z )z *z +z ,z -'  # AutoTokenizer
'<s>  \x00z \x07z \x0ez \x15z \x1cz  z !z "z $z %z &z (z )z *z +z ,z -'  # Llama.cpp

llama-bpe removes spaces before some punctuation characters. Re-tokenization is different.
phi-3 seems more consistent, maybe removing the BOS token and the first space is enough.

Probably a few models can achieve this, but Information can be lost in tokenization (normalization, lstrip, rstrip, etc).

Harsha-Nori · 2024-06-25T16:50:50Z

Hmm, great point. I think what I'm really hoping for is eventual stability on the second or third tokenize/detokenize cycles -- before your PR, Phi-3 had the problem of constantly changing the token_id at index 1 (due to growing spaces), which really caused issues.

I think this set of changes is good enough to solve most of our problems :).

jaime-m-p added 2 commits June 20, 2024 17:51

Add llama_detokenize()

eea8dfa

Using llama_tokenize() in tests

d779bab

jaime-m-p added the bugfix fixes an issue or bug label Jun 20, 2024

jaime-m-p marked this pull request as draft June 20, 2024 16:58

Using llama_tokenize() in tests

40a6660

jaime-m-p added 3 commits June 20, 2024 19:18

Fix tokenizer tests

16a7503

minor: confusing hexadecimal codepoint

03dbcc8

Clean old known problematic codepoints

071bf42

github-actions bot added the testing Everything test related label Jun 20, 2024

Update bruteforce random tests

064b35e

Add detokenizer checks New generator: ascii_lr_strip New generator: apostrophe Add more vocabs files

github-actions bot added the python python script changes label Jun 20, 2024

jaime-m-p added 3 commits June 20, 2024 22:48

Fix add_space_prefix, set false by default

503b753

Remove previous space

0cc6593

Remove previous space

6d233bc

github-actions bot added the examples label Jun 20, 2024

Add tokenizer flag: clean_up_tokenization_spaces

b452e82

mofosyne added the Review Complexity : Medium Generally require more time to grok but manageable by beginner to medium expertise level label Jun 21, 2024

jaime-m-p added 4 commits June 23, 2024 20:49

tests: unexpected vocab type as test fail instead of error

9af762c

Useful when automating tests: - If you don't know in advance the vocab type. - Differenciate other loading errors.

tests: gracefully exit threads

0cf2989

Using exit() is throwing random exceptions

tets: skip unicode surrogaes and undefined

38d54b3

Fix detokenizer():

44c8648

UNKNOWN and CONTROL are 'special pieces'. Remove space after UNKNOWN and CONTROL. Refactor llama_token_to_piece().

jaime-m-p added 4 commits June 24, 2024 20:37

Do not remove space when decoding special tokens

9eb0fca

style: remove trailing whitespace

12e2c31

Bugfix: custom regexs splits undefined unicode codepoints

95a0df5

Update brute force test:

4a28063

Detokenize special tokens. Replace errors with '\uFFFD' when detokenizing to 'utf-8'. More edge cases. Better detokenization results check.

jaime-m-p added 3 commits June 25, 2024 17:28

Symetric params for llama_tokenize() and llama_detokenize()

9854a9c

Better leading space removal

107923c

Update bruteforce test

68220fe

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Detokenizer fixes #8039

Detokenizer fixes #8039

jaime-m-p commented Jun 20, 2024

jaime-m-p commented Jun 20, 2024 •

edited

Loading

jaime-m-p commented Jun 20, 2024 •

edited

Loading

jaime-m-p commented Jun 20, 2024

jaime-m-p commented Jun 20, 2024 •

edited

Loading

cmp-nct commented Jun 23, 2024 •

edited

Loading

jaime-m-p commented Jun 23, 2024

jaime-m-p commented Jun 23, 2024

cmp-nct commented Jun 24, 2024 •

edited

Loading

Harsha-Nori commented Jun 24, 2024 •

edited

Loading

jaime-m-p commented Jun 24, 2024 •

edited

Loading

jaime-m-p commented Jun 25, 2024

Harsha-Nori commented Jun 25, 2024

Detokenizer fixes #8039

Are you sure you want to change the base?

Detokenizer fixes #8039

Conversation

jaime-m-p commented Jun 20, 2024

jaime-m-p commented Jun 20, 2024 • edited Loading

jaime-m-p commented Jun 20, 2024 • edited Loading

jaime-m-p commented Jun 20, 2024

jaime-m-p commented Jun 20, 2024 • edited Loading

cmp-nct commented Jun 23, 2024 • edited Loading

jaime-m-p commented Jun 23, 2024

jaime-m-p commented Jun 23, 2024

cmp-nct commented Jun 24, 2024 • edited Loading

Harsha-Nori commented Jun 24, 2024 • edited Loading

jaime-m-p commented Jun 24, 2024 • edited Loading

jaime-m-p commented Jun 25, 2024

Harsha-Nori commented Jun 25, 2024

jaime-m-p commented Jun 20, 2024 •

edited

Loading

jaime-m-p commented Jun 20, 2024 •

edited

Loading

jaime-m-p commented Jun 20, 2024 •

edited

Loading

cmp-nct commented Jun 23, 2024 •

edited

Loading

cmp-nct commented Jun 24, 2024 •

edited

Loading

Harsha-Nori commented Jun 24, 2024 •

edited

Loading

jaime-m-p commented Jun 24, 2024 •

edited

Loading