You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Expected behavior
After I've parsed the commandR+ tokenizer specification if I tokenize a special token like <|YES_TOKEN|> it correctly recognize it as a a single token (upper part of the picture), but if I use in a sentence, like good<|YES_TOKEN|>good tokenization does not recognize the special token (lower part of the picture)
Since I've created the .tiktoken file from the specification it is possible that I've done something wrong, other tokens seems to be recognized just good.
Screenshots, Code, Sample Projects
The text was updated successfully, but these errors were encountered:
Windows 11, .NET 8 version of library 0.22.0-preview.24271.1
Describe the bug
I'm trying to use tiktoken class to implement Cohere Command R+ tokenizer from the vocabulary file that cohere produces at https://storage.googleapis.com/cohere-public/tokenizers/command-r-plus.json
To Reproduce
you can find a playbook here https://github.com/alkampfergit/ai-playground/blob/develop/src/python/langchainVarious/Tokenization/dotnetcohere.dib Actually I downloaded the file, then extract vocabulary node and create tiktoken file, then parse the json file to grab special tokens.
Expected behavior
After I've parsed the commandR+ tokenizer specification if I tokenize a special token like <|YES_TOKEN|> it correctly recognize it as a a single token (upper part of the picture), but if I use in a sentence, like good<|YES_TOKEN|>good tokenization does not recognize the special token (lower part of the picture)
Since I've created the .tiktoken file from the specification it is possible that I've done something wrong, other tokens seems to be recognized just good.
Screenshots, Code, Sample Projects
The text was updated successfully, but these errors were encountered: