Skip to content
This repository has been archived by the owner on Jul 4, 2023. It is now read-only.

Add GLUE datasets #26

Open
PetrochukM opened this issue Apr 27, 2018 · 7 comments
Open

Add GLUE datasets #26

PetrochukM opened this issue Apr 27, 2018 · 7 comments
Labels
enhancement New feature or request good first issue Good for newcomers help wanted Extra attention is needed

Comments

@PetrochukM
Copy link
Owner

GLUE datasets are standard for evaluating NLU tasks.

In pursuit of this objective, we introduce the General Language Understanding Evaluation benchmark
(GLUE), a tool for evaluating and analyzing the performance of models across a diverse range of existing NLU tasks.

@PetrochukM PetrochukM added enhancement New feature or request help wanted Extra attention is needed good first issue Good for newcomers labels Apr 27, 2018
@PattynR
Copy link

PattynR commented Nov 9, 2018

Hi, I am a Belgian student in computer engineering, I am following an introduction course about open source. One of my goal this semester is to make a contribution to a project. My master thesis will be related to NLP, this is why this project interest me. Is there a way I could help fixing this issue? (or maybe another issue related to this project)

@PetrochukM
Copy link
Owner Author

Hi There!

Yeah, please fix this issue! GLUE datasets are a popular suite of datasets for evaluating NLP models. It'd be nice if there was support for those datasets. This issue should be an easy one to get started with.

Recently, I was at Belgium for EMNLP 2018. One of the best NLP conferences in the world.

@PattynR
Copy link

PattynR commented Nov 18, 2018

Hey, so bad I missed the EMNLP! This is the first year I work on NLP, and I had never heard about those conferences, I hope I'll be able to go there next year.
About the issue, could you please confirm that my job is to add a new file into the torchnlp/datasets folder? A file that would be named "glue.py". I guess this is what I have to do, but I would prefer to be completely sure!

@PetrochukM
Copy link
Owner Author

Yeah that'd work!

@PattynR
Copy link

PattynR commented Dec 8, 2018

Hi,
I'm almost done, for the moment it works for all the datasets of GLUE except for QQP and SNLI. There is an issue with those files that I don't know how to handle ... When I load the QQP and SNLI datasets, there are some lines in the files themselves that doesn't have the right amount of parameters. Here is an example to illustrate what I mean.

On the first line of each downloaded file, we can find the names of the different features of the tsv file. In the 'train.tsv' file of SNLI for example, there should be 11 features per line. There are however a lot of lines (38.656 in total) where there are more than 10 tabs, so more than 11 features ....

For the moment I decided not to add those lines in the Dataset object, but I know this is not what should be done. I've looked on the internet to find a meaning to those lines, but there is not a lot of documentation about QQP and SNLI.

So do you maybe know what I should do? Or should I add my file to the project, and create a new issue? Someone that has already worked with those datasets should be able to fix it easily.

Thanks.

@PetrochukM
Copy link
Owner Author

Thanks for your attempt at contributing this function: #60 :)

@karish-grover
Copy link

Hey! I want to give this a try. Is there any way that I can do it still? It seems like it's too late to contribute to this project.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement New feature or request good first issue Good for newcomers help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

3 participants