Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: support custom regexes for GPT pre-tokenizer #1462

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

gcampax
Copy link

@gcampax gcampax commented Feb 28, 2024

In order to properly support GPT3.5/GPT4 models which changed the regex compared to GPT2

Existing tokenizer files are not affected. New tokenizer files can be created that copy the new regex from the tiktoken sources.

I'm new to Rust, so the code probably doesn't look very idiomatic. Happy to adjust it.

In order to properly support GPT3.5/GPT4 models which changed
the regex compared to GPT2
@Narsil
Copy link
Collaborator

Narsil commented Mar 11, 2024

There's no need for that, just use use_regex:false and use a regex externally (as another pre-tokenizer for instance) if you want.

@Narsil Narsil closed this Mar 11, 2024
@gcampax
Copy link
Author

gcampax commented Mar 11, 2024

I'm sorry, but this is not correct. use_regex false does not apply in this context: you do need to apply the regular expression. Applying the regex externally also doesn't apply: token splitting by regex is done inside the preprocessor. I don't know how you would apply it externally.
It's also quite inconvenient to have special case external logic for OpenAI tokenizers, instead of being able to specify a JSON file that actually works.

The PR is not that big, I would kindly ask you to reconsider.

@ArthurZucker
Copy link
Collaborator

Hey! I am down to re-consider, I think that Narsil meant is that you can have a sequence of pre_processor, to first do the regex then apply bytelevel (this would be "external"). But to be honest, since there is a regex inside, why waste it!

@ArthurZucker ArthurZucker reopened this Jun 11, 2024
@ArthurZucker
Copy link
Collaborator

@gcampax do you want to rebase and I'll review?

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants