Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error calling tokenizer.get_vocab() (Codegen2.5) #85

Open
ShushanArakelyan opened this issue Oct 23, 2023 · 1 comment
Open

Error calling tokenizer.get_vocab() (Codegen2.5) #85

ShushanArakelyan opened this issue Oct 23, 2023 · 1 comment

Comments

@ShushanArakelyan
Copy link

ShushanArakelyan commented Oct 23, 2023

I wanted to check if Codegen2.5 uses the same vocabulary as Codegen2 (a question to the authors: does it?), and noticed that calling .get_vocab() on tokenizer produces an error.

How to reproduce:

from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("Salesforce/codegen25-7b-multi", trust_remote_code=True)
tokenizer.get_vocab()

The expected output would be a dictionary with vocabulary.
The output I get instead is:

"UnicodeDecodeError Traceback (most recent call last)
Cell In[18], line 1
----> 1 tokenizer.get_vocab()

File /home/shushan/.cache/huggingface/modules/transformers_modules/Salesforce/codegen25-7b-multi/d4dc9dd90e8b23d5411e6d970e3a11e88dc5c2bc/tokenization_codegen25.py:153, in CodeGen25Tokenizer.get_vocab(self)
151 def get_vocab(self):
152 """Returns vocab as a dict"""
--> 153 vocab = {self._convert_id_to_token(i): i for i in range(self.vocab_size)}
154 return vocab

File /home/shushan/.cache/huggingface/modules/transformers_modules/Salesforce/codegen25-7b-multi/d4dc9dd90e8b23d5411e6d970e3a11e88dc5c2bc/tokenization_codegen25.py:153, in (.0)
151 def get_vocab(self):
152 """Returns vocab as a dict"""
--> 153 vocab = {self._convert_id_to_token(i): i for i in range(self.vocab_size)}
154 return vocab

File /home/shushan/.cache/huggingface/modules/transformers_modules/Salesforce/codegen25-7b-multi/d4dc9dd90e8b23d5411e6d970e3a11e88dc5c2bc/tokenization_codegen25.py:169, in CodeGen25Tokenizer._convert_id_to_token(self, index)
167 def _convert_id_to_token(self, index):
168 """Converts an index (integer) in a token (str) using the vocab."""
--> 169 return self.encoder.decode_single_token_bytes(index).decode("utf-8")

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa1 in position 0: invalid start byte"

@nedix
Copy link

nedix commented Nov 20, 2023

I have reported the same issue on HF.

Seeing that this project has moved to Llama2 architecture, I have been attempting to convert this model to LLAMA GGML format.

I am currently at a dead end because of inoperable implementations of get_vocab and save_vocabulary methods in tokenization_codegen25.py. When attempting to invoke the get_vocab method the issue is that some of the vocabulary uses a different encoding from the defined utf-8.

These could be solutions:
a. Change tokenization_codegen25.py line 169 encoding from utf-8 to latin-1
b. With the next version of this model filter non utf-8 characters from the vocabulary

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants