Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add display capabilities to tokenizers objects #1542

Open
wants to merge 77 commits into
base: main
Choose a base branch
from
Open

Conversation

ArthurZucker
Copy link
Collaborator

@ArthurZucker ArthurZucker commented Jun 3, 2024

>>> from tokenizers import Tokenizer
>>> Tokenizer.from_pretrained("ArthurZ/new-t5-base")
Tokenizer(normalizer=normalizers.Sequence([normalizers.Precompiled(), normalizers.Strip(strip_left=false, strip_right=true), normalizers.Replace(pattern=Regex(" {2,}"), content="▁", regex=SysRegex { regex: Regex { raw: 0x1069ca350 } }]), pre_tokenizer=PreTokenizer(pretok=Metaspace(replacement='▁', prepend_scheme="first", split=true)), model=Unigram(vocab={'<pad>': 0, '</s>': 0, '<unk>': 0, '▁': -2.012292861938477, 'X': -2.486478805541992, ...}, unk_id=2, bos_id=32101, eos_id=32102), post_processor=TemplateProcessing(single=Template([Sequence { id: A, type_id: 0 }, SpecialToken { id: "</s>", type_id: 0 }]), pair=Template([Sequence { id: A, type_id: 0 }, SpecialToken { id: "</s>", type_id: 0 }, Sequence { id: B, type_id: 0 }, SpecialToken { id: "</s>", type_id: 0 }])), decoder=Metaspace(replacement='▁', prepend_scheme="first", split=true), added_vocab=AddedVocabulary(added_tokens_map_r={
        0: AddedToken(content="<pad>", single_word=false, lstrip=false, rstrip=false, normalized=false, special=true), 
        1: AddedToken(content="</s>", single_word=false, lstrip=false, rstrip=false, normalized=false, special=true), 
        2: AddedToken(content="<unk>", single_word=false, lstrip=false, rstrip=false, normalized=false, special=true), ...}, encode_special_tokens=false), truncation=None, padding=None)
image

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

…add-display

fix git suggestion

nit

__repr__ should use Debug?

small updates

Simple lazygit test
Copy link
Member

@McPatate McPatate left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit confused as to why in some cases you impl Display for MyStruct, use derive_more::Display; #[derive(Display)] struct MyStruct and then use StructDisplay.

#[serde(untagged)]
pub(crate) enum PyDecoderWrapper {
#[display(fmt = "{}", "_0.as_ref().read().unwrap().inner")]
Copy link
Member

@McPatate McPatate Jun 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this native rust or are these capabilities from the derive_more crate?

(the display macro)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, are you sure .unwrap is the right thing? Perhaps an .unwrap_or_else(some_default_display_fn) would work best?

Custom(Arc<RwLock<CustomDecoder>>),
#[display(fmt = "{}", "_0.as_ref().read().unwrap()")]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

/// Base class for all models
///
/// The model represents the actual tokenization algorithm. This is the part that
/// will contain and manage the learned vocabulary.
///
/// This class cannot be constructed directly. Please use one of the concrete models.
#[pyclass(module = "tokenizers.models", name = "Model", subclass)]
#[derive(Clone, Serialize, Deserialize)]
#[derive(Clone, Serialize, Deserialize, Display)]
#[display(fmt = "{}", "model.as_ref().read().unwrap()")]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

@@ -220,6 +221,12 @@ impl PyModel {
fn get_trainer(&self, py: Python<'_>) -> PyResult<PyObject> {
PyTrainer::from(self.model.read().unwrap().get_trainer()).get_as_subtype(py)
}
fn __str__(&self) -> PyResult<String> {
Ok(format!("{}", self.model.read().unwrap()))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If read() returns a Result, then you can probably convert it to a PyResult here rather than unwrapping it.
If it returns an Option, then perhaps returning a default value rather than unwrapping would be preferable.

Ok(format!("{}", self.model.read().unwrap()))
}
fn __repr__(&self) -> PyResult<String> {
Ok(format!("{}", self.model.read().unwrap()))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here

#[pyo3(signature = ())]
#[pyo3(text_signature = "(self)")]
#[getter]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this code equivalent?

Comment on lines 63 to 66
_ => unimplemented!(),
}
},
_ => unimplemented!(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure how to handle errors in macros, but I'd take a look rather than leaving a call to unimplemented!

Comment on lines 3 to 11
use crate::utils::SysRegex;
use serde::{Deserialize, Serialize};

use crate::tokenizer::{
Decoder, Encoding, PostProcessor, PreTokenizedString, PreTokenizer, Result,
SplitDelimiterBehavior,
};
use crate::utils::macro_rules_attribute;
use crate::utils::SysRegex;
use display_derive::StructDisplay;
use serde::{Deserialize, Serialize};

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are you using rustfmt?

Fields::Named(fields) => {
// If the struct has named fields
let field_names = fields.named.iter().map(|f| &f.ident);
let field_names2 = field_names.clone();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
let field_names2 = field_names.clone();
let field_names_clone = field_names.clone();

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, why do you need to clone?

Comment on lines 41 to 44
let mut prefix = (&mut chars).take(100 - 1).collect::<String>();
if chars.next().is_some() {
prefix.push('…');
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wasn't that what the ellipse crate was for?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah but it was too annoying to use 😢

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then remove it from your Cargo.toml file 😉

I think what you wrote is perfectly fine and does not require bringing in the extra crate!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh I thought I removed it lol on it!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants