Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question]: Extending NER tags of Hunflair #3447

Open
skywalker2202 opened this issue Apr 26, 2024 · 0 comments
Open

[Question]: Extending NER tags of Hunflair #3447

skywalker2202 opened this issue Apr 26, 2024 · 0 comments
Labels
question Further information is requested

Comments

@skywalker2202
Copy link

skywalker2202 commented Apr 26, 2024

Question

I wanted to fine-tune the Hunflair-gene model and extend the tags in the original model. The Hunflair gene contains the following items - ['', 'O', 'S-Gene', 'B-Gene', 'I-Gene', 'E-Gene', '', ''].

However, when I do "previous_tag_dictionary.span_labels()" gives "AttributeError: 'Dictionary' object has no attribute 'span_labels'"

previous_tagger = SequenceTagger.load("hunflair-gene") previous_tag_dictionary = previous_tagger.label_dictionary previous_tag_dictionary.get_items()

outputs ['<unk>', 'O', 'S-Gene', 'B-Gene', 'I-Gene', 'E-Gene', '<START>', '<STOP>'].

I have my annotated corpus which contains 2 tags - LIG and REC. I have converted them to a column-corpus and created a new tag dictionary from it.
columns = {0: 'text', 1: 'ner'} corpus = ColumnCorpus(config["data_folder"], columns, train_file='train.txt', dev_file='val.txt', test_file="test.txt") new_tag_dictionary = corpus.make_label_dictionary(label_type='ner', add_unk=False) new_tag_dictionary.get_items()

Which outputs

`2024-04-26 16:16:18,169 Dictionary created for label 'ner' with 2 values: LIG (seen 719 times), REC (seen 296 times)

['LIG', 'REC']
`
I want to finetune the hunflair-gene on the new dataset. As per my understanding, I need to create a new tag dictionary. When I try the following

for old_tag in previous_tag_dictionary.get_items():
new_tag_dictionary.add_item(str(old_tag))

print(f"Updated tag dictionary : {new_tag_dictionary}")
it outputsUpdated tag dictionary : Dictionary with 10 tags: LIG, REC, , O, S-Gene, B-Gene, I-Gene, E-Gene, ,

However, when I do

tagger_new = SequenceTagger( hidden_size=256, embeddings=previous_tagger.embeddings, tag_dictionary=new_tag_dictionary, tag_type='ner', )
it outputs

2024-04-26 16:16:31,545 SequenceTagger predicts: Dictionary with 37 tags: O, S-LIG, B-LIG, E-LIG, I-LIG, S-REC, B-REC, E-REC, I-REC, S-O, B-O, E-O, I-O, S-S-Gene, B-S-Gene, E-S-Gene, I-S-Gene, S-B-Gene, B-B-Gene, E-B-Gene, I-B-Gene, S-I-Gene, B-I-Gene, E-I-Gene, I-I-Gene, S-E-Gene, B-E-Gene, E-E-Gene, I-E-Gene, S-<START>, B-<START>, E-<START>, I-<START>, S-<STOP>, B-<STOP>, E-<STOP>, I-<STOP>
These are too many tags. Any help will me appreciated.

@skywalker2202 skywalker2202 added the question Further information is requested label Apr 26, 2024
@skywalker2202 skywalker2202 changed the title [Question]: Extending NER tags of Hunflair [Issue]: Extending NER tags of Hunflair Apr 26, 2024
@skywalker2202 skywalker2202 changed the title [Issue]: Extending NER tags of Hunflair [Question]: Extending NER tags of Hunflair Apr 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

1 participant