Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merges cannot handle tokens containing spaces. #909

Draft
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

Narsil
Copy link
Collaborator

@Narsil Narsil commented Feb 16, 2022

This fixes this while keeping backward support.
We don't want to merge that blindly.

@nihirv
Copy link

nihirv commented Nov 22, 2022

Hi. I found this PR through #566.

How can I use the PR? I wasn't able to see the rust files in my tokenizers site-packages, and the README on the repo doesn't give instructions on building from source

This fixes this while keeping backward support.
We don't want to merge that blindly.
@Narsil Narsil force-pushed the support_bpe_with_whitespace_tokens branch from 74a4c9e to 0f617cc Compare November 23, 2022 19:20
@Narsil
Copy link
Collaborator Author

Narsil commented Nov 23, 2022

Here are the docs for installing from source:

https://huggingface.co/docs/tokenizers/installation#installation-from-sources

I rebased again main so this PR has all the features developped since. Please report if this suits your use case, it would push the need to merge. I hesitate to merge useless features, so if it has some use cases it's nice to know.

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

@nihirv
Copy link

nihirv commented Nov 24, 2022

Thanks for the speedy response Narsil.

To clarify, this PR allows us to have merges in our tokenizer.json file which may have multiple whitespaces for each merge?

I'll play around with it tomorrow and let you know 👍

@Narsil
Copy link
Collaborator Author

Narsil commented Nov 24, 2022

Yes exactly, this allows any number of spaces or any other char in the merges.

@nihirv
Copy link

nihirv commented Nov 24, 2022

Wow that installation process was painless. I guess there is a first time for everything!

Anyway, I'm still facing the error that was there in the original issue: data did not match any variant of untagged enum ModelWrapper at line 24420 column 3. The merges in tokenizer.json are now a length-2 list of strings instead of a string - so I'm pretty sure your branch was used successfully.

The error is getting thrown at this line: fast_tokenizer = TokenizerFast.from_file(fast_tokenizer_file) which is located in transformers/tokenization_utils_fast.py

@Narsil
Copy link
Collaborator Author

Narsil commented Nov 24, 2022

Could you share your serialized file somehow ? I could take a look.

@nihirv
Copy link

nihirv commented Nov 24, 2022

@nihirv
Copy link

nihirv commented Nov 28, 2022

Hey man, did you get a chance to take a look?

@Narsil
Copy link
Collaborator Author

Narsil commented Nov 28, 2022

Thanks for reminding. I just did, it seems the merges you are referrring in your file are not present in the vocabulary making the merges unsound.

How did you create that file ?

@Narsil
Copy link
Collaborator Author

Narsil commented Nov 28, 2022

Here is how I'm testing this in Rust to get a better error message:

      #[test]
      // Ensure `BPE::from_file` works as expected.
      fn test_bpe_from_file2() {
          // Make sure we can instantiate a BPE model from the files.
          let data = std::fs::read_to_string("test.json").unwrap();
          let bpe: BPE = serde_json::from_str(&data).unwrap();
      }

Just add this in the rust source code and run cargo test to see how it does (update test.json to point to the "model" part of the JSON.

Making better error messages all the way into python is trickier because of the libs we use and how this library is structured (though it would be really cool to have).

@nihirv
Copy link

nihirv commented Nov 29, 2022

I'm trying to get BPE to learn the ideal character sequence to split on instead of splitting on whitespace first.

Something like this:

    tokenizer = Tokenizer(BPE())  # type: ignore
    # data has been preprocessed to add Ġ to the beginning of every word in the corpus
    tokenizer.normalizer = normalizers.Sequence([Lowercase(), normalizers.Replace("ġ", 'Ġ')])
    
    # Customize training
    trainer = BpeTrainer(vocab_size=5000, min_frequency=2,
                    show_progress=True,
                    special_tokens=[
                                    "<pad>",
                                    "<s>",
                                    "</s>",
                                    "<unk>",
                                    "<mask>",
    ])
    tokenizer.train(files=file_list_for_language, trainer=trainer)
    tokenizer_path = f"tokenizers/{language}/"
    os.makedirs(tokenizer_path, exist_ok=True)
    new_tokenizer = PreTrainedTokenizerFast(tokenizer_object=tokenizer)

    new_tokenizer.save_pretrained(tokenizer_path)

I'm also doing some postprocessing which modifies the vocabulary directly. This may be where the discrepancy with the merges is coming from:

    with open(tokenizer_path+"tokenizer.json", "r") as f:
        data = json.load(f)
        model_vocab: Dict[str, int] = data["model"]["vocab"]
        postprocessed_vocab = {}
        for tok, idx in model_vocab.items():
            if idx < 6:
                postprocessed_vocab[tok] = idx
                continue
            tok = tok.strip()
            if tok.endswith("Ġ.##"):
                tok = tok[:-4].strip()
            if tok.endswith("Ġ"):
                tok = tok[:-1].strip()

            postprocessed_vocab[tok] = idx

I'm not entirely sure how I can "clean" my vocabulary and make the relevant changes in the merges file?

@Narsil
Copy link
Collaborator Author

Narsil commented Nov 29, 2022

Yes it's your postprocessing that is most likely causing the issue.

@ashutoshbsathe
Copy link

Hi, can this please be merged as a different (de)serialization format? Several issues through years have encountered this issue and have used this PR to fix it. Making sure that this is a separate serialization format would be fully backwards compatible and explicitly alert people to use the different functions to (de)serialize the tokenizer.

@ArthurZucker
Copy link
Collaborator

Hey @ashutoshbsathe if there are more issue that are not linked / any particular use cases could you link them here? I'll see what I can do!

@ArthurZucker ArthurZucker reopened this Jun 11, 2024
@github-actions github-actions bot removed the Stale label Jun 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants