-
Notifications
You must be signed in to change notification settings - Fork 210
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update numbers in retrieval statistics #988
Conversation
@KennethEnevoldsen @isaac-chung I replaced one of the keys with the full information. How do we want to break this up to fit in the existing schema? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this looks very good. We can def. keep this as is.
I think I got every retrieval dataset, except for MSMarcoV2. My machine unfortunately died every time I tried to calculate it, apparently it requires a lot of RAM. Edit: the tests failing appears to be a pydantic error. I am fairly new to pydantic and their error messages confuse me a bit -- @isaac-chung do you have any ideas on what is wrong? |
@orionw the pydantic error seems to be related to the presence of |
Ah thank you! I tried to delete all of them but clearly missed some. I'll update it. EDIT: you're fast, you already did - thanks! |
I'll try to run |
Thanks @isaac-chung!! Good luck, it takes quite a while! TBH if neither of us can load it, perhaps we need a more efficient dataset loader or simply to remove MSMarcoV2 from our list. My machine had quite a lot of RAM so it's pretty inaccessible -- looking at the specs on IR datasets it has 138 million passages!! |
Wow! |
It went OOM killed already. Let's merge this and consider our options with MSMARCOv2. |
Update the retrieval statistics to match the new metadata processing.
Checklist
make test
.make lint
.