Update numbers in retrieval statistics #988

orionw · 2024-06-25T22:33:16Z

Update the retrieval statistics to match the new metadata processing.

Checklist

Run tests locally to make sure nothing is broken using make test.
Run the formatter to format the code using make lint.

orionw · 2024-06-25T22:34:02Z

@KennethEnevoldsen @isaac-chung I replaced one of the keys with the full information.

How do we want to break this up to fit in the existing schema?

KennethEnevoldsen

I think this looks very good. We can def. keep this as is.

mteb/tasks/Retrieval/code/CodeSearchNetRetrieval.py

orionw · 2024-06-26T17:36:03Z

I think I got every retrieval dataset, except for MSMarcoV2. My machine unfortunately died every time I tried to calculate it, apparently it requires a lot of RAM.

Edit: the tests failing appears to be a pydantic error. I am fairly new to pydantic and their error messages confuse me a bit -- @isaac-chung do you have any ideas on what is wrong?

isaac-chung · 2024-06-26T18:06:22Z

@orionw the pydantic error seems to be related to the presence of "task_name" in some dicts, e.g. mteb/tasks/Retrieval/dan/TV2Nordretrieval.py. The validation checks for a dict of {str:float} (for other tasks) or {str:{str: dict}} (for this change). The extra key will make it upset.

orionw · 2024-06-26T18:18:03Z

Ah thank you! I tried to delete all of them but clearly missed some. I'll update it. EDIT: you're fast, you already did - thanks!

isaac-chung · 2024-06-26T18:21:12Z

I'll try to run calculate_metadata_metrics for MSMARCOv2 on my machine.
Otherwise this is good to merge :D

orionw · 2024-06-26T18:32:25Z

Thanks @isaac-chung!! Good luck, it takes quite a while!

TBH if neither of us can load it, perhaps we need a more efficient dataset loader or simply to remove MSMarcoV2 from our list. My machine had quite a lot of RAM so it's pretty inaccessible -- looking at the specs on IR datasets it has 138 million passages!!

isaac-chung · 2024-06-26T19:18:55Z

Wow!
It seems to be running still. I'll give it till the morning (GMT+3) and report back.

isaac-chung · 2024-06-26T21:29:27Z

It went OOM killed already. Let's merge this and consider our options with MSMARCOv2.

example

6005251

KennethEnevoldsen approved these changes Jun 26, 2024

View reviewed changes

isaac-chung reviewed Jun 26, 2024

View reviewed changes

mteb/tasks/Retrieval/code/CodeSearchNetRetrieval.py Show resolved Hide resolved

isaac-chung reviewed Jun 26, 2024

View reviewed changes

mteb/tasks/Retrieval/code/CodeSearchNetRetrieval.py Outdated Show resolved Hide resolved

isaac-chung and others added 3 commits June 26, 2024 08:15

fix validation error

d7213dc

add retrieval task info

0e6fbc9

lint

6887512

orionw changed the title ~~WIP: Update numbers in retrieval statistics~~ Update numbers in retrieval statistics Jun 26, 2024

orionw marked this pull request as ready for review June 26, 2024 17:35

remove task_name key

e0f0e2a

isaac-chung merged commit 10c3fbf into main Jun 26, 2024
7 checks passed

isaac-chung deleted the update_retrieval_stats branch June 26, 2024 21:29

isaac-chung mentioned this pull request Jun 27, 2024

calculate_metadata_metrics on MSMARCOv2 goes OOM #992

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update numbers in retrieval statistics #988

Update numbers in retrieval statistics #988

orionw commented Jun 25, 2024

orionw commented Jun 25, 2024

KennethEnevoldsen left a comment

orionw commented Jun 26, 2024 •

edited

Loading

isaac-chung commented Jun 26, 2024 •

edited

Loading

orionw commented Jun 26, 2024 •

edited

Loading

isaac-chung commented Jun 26, 2024

orionw commented Jun 26, 2024

isaac-chung commented Jun 26, 2024

isaac-chung commented Jun 26, 2024

Update numbers in retrieval statistics #988

Update numbers in retrieval statistics #988

Conversation

orionw commented Jun 25, 2024

Checklist

orionw commented Jun 25, 2024

KennethEnevoldsen left a comment

Choose a reason for hiding this comment

orionw commented Jun 26, 2024 • edited Loading

isaac-chung commented Jun 26, 2024 • edited Loading

orionw commented Jun 26, 2024 • edited Loading

isaac-chung commented Jun 26, 2024

orionw commented Jun 26, 2024

isaac-chung commented Jun 26, 2024

isaac-chung commented Jun 26, 2024

orionw commented Jun 26, 2024 •

edited

Loading

isaac-chung commented Jun 26, 2024 •

edited

Loading

orionw commented Jun 26, 2024 •

edited

Loading