Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Special Tokens handling seems to be incorrect (at least in my scenario where I'm creating Command R+ tiktoken file from specification) #7160

Open
alkampfergit opened this issue May 27, 2024 · 0 comments
Labels
untriaged New issue has not been triaged

Comments

@alkampfergit
Copy link

Windows 11, .NET 8 version of library 0.22.0-preview.24271.1

Describe the bug
I'm trying to use tiktoken class to implement Cohere Command R+ tokenizer from the vocabulary file that cohere produces at https://storage.googleapis.com/cohere-public/tokenizers/command-r-plus.json

To Reproduce
you can find a playbook here https://github.com/alkampfergit/ai-playground/blob/develop/src/python/langchainVarious/Tokenization/dotnetcohere.dib Actually I downloaded the file, then extract vocabulary node and create tiktoken file, then parse the json file to grab special tokens.

Expected behavior
After I've parsed the commandR+ tokenizer specification if I tokenize a special token like <|YES_TOKEN|> it correctly recognize it as a a single token (upper part of the picture), but if I use in a sentence, like good<|YES_TOKEN|>good tokenization does not recognize the special token (lower part of the picture)

Since I've created the .tiktoken file from the specification it is possible that I've done something wrong, other tokens seems to be recognized just good.

Screenshots, Code, Sample Projects
image

@dotnet-policy-service dotnet-policy-service bot added the untriaged New issue has not been triaged label May 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
untriaged New issue has not been triaged
Projects
None yet
Development

No branches or pull requests

1 participant