Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataset Help #54

Open
KeyLKey opened this issue Nov 1, 2023 · 6 comments
Open

Dataset Help #54

KeyLKey opened this issue Nov 1, 2023 · 6 comments

Comments

@KeyLKey
Copy link

KeyLKey commented Nov 1, 2023

It's a great honor to see your masterpiece, but now I'm facing difficulties. Can you provide nci_entities.json data file, thank you very much

@igorcouto
Copy link

+1 in seeing the dataset and better instructions. I received error messages in everything I tried.

@HamedBabaei
Copy link
Owner

It's a great honor to see your masterpiece, but now I'm facing difficulties. Can you provide nci_entities.json data file, thank you very much

Dear @KeyLKey , thanks for the comment, I will add more information on how to create data! Unfortunately due to the LICENSE‌ of UMLS datasets, we might not be able to share it, however, we can provide the details of how to create one.

@HamedBabaei
Copy link
Owner

+1 in seeing the dataset and better instructions. I received error messages in everything I tried.

Dear @igorcouto , thanks for the comment, can you share the error message with me till I can check what could be the issue and fix it?
thanks

@KeyLKey
Copy link
Author

KeyLKey commented Nov 14, 2023

It's a great honor to see your masterpiece, but now I'm facing difficulties. Can you provide nci_entities.json data file, thank you very much

Dear @KeyLKey , thanks for the comment, I will add more information on how to create data! Unfortunately due to the LICENSE‌ of UMLS datasets, we might not be able to share it, however, we can provide the details of how to create one.

Dear author, could you tell me which data file to download? Is it named UMLS Metathesaurus Full Subset or UMLS Semantic Network files? The former decompressed 27.1GB, and I am very eager to build a dataset like yours. Thank you very much!

@HamedBabaei
Copy link
Owner

It's a great honor to see your masterpiece, but now I'm facing difficulties. Can you provide nci_entities.json data file, thank you very much

Dear @KeyLKey , thanks for the comment, I will add more information on how to create data! Unfortunately due to the LICENSE‌ of UMLS datasets, we might not be able to share it, however, we can provide the details of how to create one.

Dear author, could you tell me which data file to download? Is it named UMLS Metathesaurus Full Subset or UMLS Semantic Network files? The former decompressed 27.1GB, and I am very eager to build a dataset like yours. Thank you very much!

Hi @KeyLKey, for UMLS you need to download the umls-2022AB-metathesaurus-full.zip file and follow the instructions for creating a dataset for Task A, B, C using this notebook TaskA/notebooks/umls-dataset-preprations_for_TaskABC.ipynb (this is available in the repository).

You will build datasets for MEDCIN, NCI, and SNOMEDCT_US.

More later, for Task A, you need to run TaskA/build_entity_dataset.py only the last parts which are as follows:

    config = BaseConfig(version=3).get_args(kb_name="umls")
    umls_builder = dataset_builder(config=config)
    dataset_json, dataset_stats = umls_builder.build()
    for kb in list(dataset_json.keys()):
        DataWriter.write_json(data=dataset_json[kb],
                              path=BaseConfig(version=3).get_args(kb_name=kb.lower()).entity_path)
        DataWriter.write_json(data=dataset_stats[kb],
                              path=BaseConfig(version=3).get_args(kb_name=kb.lower()).dataset_stats)

You need to look at the TaskA/configuration/config.py to make sure you have the right path to be sent to create the dataset.

for task B you need to run the following scripts (please also consider checking those scripts to use only for UMLS)

1. build_hierarchy.py
2. build_datasets.py
3. train_test_split.py

And for C please only run the following script:

1. build_datasets.py
2. train_test_split.py

I hope this helps and Good Luck,

@tage384
Copy link

tage384 commented Nov 23, 2023

Dear author, I'm trying to build nci_entities.json following your method, but found that it's missing UMLS_entity_types_with_levels.tsv.May I ask what went wrong? Thank you very much!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants