Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix regexp parsing for bbh_cot_fewshot #2013

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

arkapal3
Copy link

Hello,

The regexp match for answer parsing for CoT fewshot on BBH is too strict; in particular, I have found that for Qwen2-72B-Instruct, the score improves from 41.4% to 80.4% with this fix applied (num_fewshot = 3). Full results are below; on some BBH subtasks such as boolean_expressions, the score increases from 0% to 96%.

The issue arises due to Qwen2 producing an extra space at the end of its generation, like so: So the answer is (B). . This is then parsed with the existing regexp as (B). which fails the exact match to the expected answer (B).

Full comparison before and after fix below. Before:


|                          Tasks                           |Version|  Filter  |n-shot|  Metric   |   |Value |   |Stderr|
|----------------------------------------------------------|-------|----------|-----:|-----------|---|-----:|---|-----:|
|bbh                                                       |N/A    |get-answer|     3|exact_match|↑  |0.4138|±  |0.0047|
| - bbh_cot_fewshot_boolean_expressions                    |      2|get-answer|     3|exact_match|↑  |0.0000|±  |0.0000|
| - bbh_cot_fewshot_causal_judgement                       |      2|get-answer|     3|exact_match|↑  |0.1444|±  |0.0258|
| - bbh_cot_fewshot_date_understanding                     |      2|get-answer|     3|exact_match|↑  |0.6000|±  |0.0310|
| - bbh_cot_fewshot_disambiguation_qa                      |      2|get-answer|     3|exact_match|↑  |0.4160|±  |0.0312|
| - bbh_cot_fewshot_dyck_languages                         |      2|get-answer|     3|exact_match|↑  |0.0720|±  |0.0164|
| - bbh_cot_fewshot_formal_fallacies                       |      2|get-answer|     3|exact_match|↑  |0.0280|±  |0.0105|
| - bbh_cot_fewshot_geometric_shapes                       |      2|get-answer|     3|exact_match|↑  |0.2240|±  |0.0264|
| - bbh_cot_fewshot_hyperbaton                             |      2|get-answer|     3|exact_match|↑  |0.2920|±  |0.0288|
| - bbh_cot_fewshot_logical_deduction_five_objects         |      2|get-answer|     3|exact_match|↑  |0.6560|±  |0.0301|
| - bbh_cot_fewshot_logical_deduction_seven_objects        |      2|get-answer|     3|exact_match|↑  |0.4520|±  |0.0315|
| - bbh_cot_fewshot_logical_deduction_three_objects        |      2|get-answer|     3|exact_match|↑  |0.8480|±  |0.0228|
| - bbh_cot_fewshot_movie_recommendation                   |      2|get-answer|     3|exact_match|↑  |0.4160|±  |0.0312|
| - bbh_cot_fewshot_multistep_arithmetic_two               |      2|get-answer|     3|exact_match|↑  |0.0000|±  |0.0000|
| - bbh_cot_fewshot_navigate                               |      2|get-answer|     3|exact_match|↑  |0.7440|±  |0.0277|
| - bbh_cot_fewshot_object_counting                        |      2|get-answer|     3|exact_match|↑  |0.0240|±  |0.0097|
| - bbh_cot_fewshot_penguins_in_a_table                    |      2|get-answer|     3|exact_match|↑  |0.5068|±  |0.0415|
| - bbh_cot_fewshot_reasoning_about_colored_objects        |      2|get-answer|     3|exact_match|↑  |0.7280|±  |0.0282|
| - bbh_cot_fewshot_ruin_names                             |      2|get-answer|     3|exact_match|↑  |0.7880|±  |0.0259|
| - bbh_cot_fewshot_salient_translation_error_detection    |      2|get-answer|     3|exact_match|↑  |0.4160|±  |0.0312|
| - bbh_cot_fewshot_snarks                                 |      2|get-answer|     3|exact_match|↑  |0.8090|±  |0.0295|
| - bbh_cot_fewshot_sports_understanding                   |      2|get-answer|     3|exact_match|↑  |0.0000|±  |0.0000|
| - bbh_cot_fewshot_temporal_sequences                     |      2|get-answer|     3|exact_match|↑  |0.5840|±  |0.0312|
| - bbh_cot_fewshot_tracking_shuffled_objects_five_objects |      2|get-answer|     3|exact_match|↑  |0.8760|±  |0.0209|
| - bbh_cot_fewshot_tracking_shuffled_objects_seven_objects|      2|get-answer|     3|exact_match|↑  |0.6560|±  |0.0301|
| - bbh_cot_fewshot_tracking_shuffled_objects_three_objects|      2|get-answer|     3|exact_match|↑  |0.8720|±  |0.0212|
| - bbh_cot_fewshot_web_of_lies                            |      2|get-answer|     3|exact_match|↑  |0.0000|±  |0.0000|
| - bbh_cot_fewshot_word_sorting                           |      2|get-answer|     3|exact_match|↑  |0.1040|±  |0.0193|

|Groups|Version|  Filter  |n-shot|  Metric   |   |Value |   |Stderr|
|------|-------|----------|-----:|-----------|---|-----:|---|-----:|
|bbh   |N/A    |get-answer|     3|exact_match|↑  |0.4138|±  |0.0047|

After:

|                          Tasks                           |Version|  Filter  |n-shot|  Metric   |   |Value |   |Stderr|       
|----------------------------------------------------------|-------|----------|-----:|-----------|---|-----:|---|-----:|       
|bbh                                                       |N/A    |get-answer|     3|exact_match|↑  |0.8036|±  |0.0044|       
| - bbh_cot_fewshot_boolean_expressions                    |      2|get-answer|     3|exact_match|↑  |0.9640|±  |0.0118|       
| - bbh_cot_fewshot_causal_judgement                       |      2|get-answer|     3|exact_match|↑  |0.6684|±  |0.0345|       
| - bbh_cot_fewshot_date_understanding                     |      2|get-answer|     3|exact_match|↑  |0.8000|±  |0.0253|       
| - bbh_cot_fewshot_disambiguation_qa                      |      2|get-answer|     3|exact_match|↑  |0.8360|±  |0.0235|       
| - bbh_cot_fewshot_dyck_languages                         |      2|get-answer|     3|exact_match|↑  |0.3040|±  |0.0292|       
| - bbh_cot_fewshot_formal_fallacies                       |      2|get-answer|     3|exact_match|↑  |0.7480|±  |0.0275|       
| - bbh_cot_fewshot_geometric_shapes                       |      2|get-answer|     3|exact_match|↑  |0.4960|±  |0.0317|       
| - bbh_cot_fewshot_hyperbaton                             |      2|get-answer|     3|exact_match|↑  |0.9440|±  |0.0146|       
| - bbh_cot_fewshot_logical_deduction_five_objects         |      2|get-answer|     3|exact_match|↑  |0.6800|±  |0.0296|       
| - bbh_cot_fewshot_logical_deduction_seven_objects        |      2|get-answer|     3|exact_match|↑  |0.4720|±  |0.0316|       
| - bbh_cot_fewshot_logical_deduction_three_objects        |      2|get-answer|     3|exact_match|↑  |0.9200|±  |0.0172|       
| - bbh_cot_fewshot_movie_recommendation                   |      2|get-answer|     3|exact_match|↑  |0.7800|±  |0.0263|       
| - bbh_cot_fewshot_multistep_arithmetic_two               |      2|get-answer|     3|exact_match|↑  |0.9760|±  |0.0097|       
| - bbh_cot_fewshot_navigate                               |      2|get-answer|     3|exact_match|↑  |0.9520|±  |0.0135|       
| - bbh_cot_fewshot_object_counting                        |      2|get-answer|     3|exact_match|↑  |0.9480|±  |0.0141|       
| - bbh_cot_fewshot_penguins_in_a_table                    |      2|get-answer|     3|exact_match|↑  |0.5753|±  |0.0410|       
| - bbh_cot_fewshot_reasoning_about_colored_objects        |      2|get-answer|     3|exact_match|↑  |0.8120|±  |0.0248|       
| - bbh_cot_fewshot_ruin_names                             |      2|get-answer|     3|exact_match|↑  |0.8760|±  |0.0209|       
| - bbh_cot_fewshot_salient_translation_error_detection    |      2|get-answer|     3|exact_match|↑  |0.5880|±  |0.0312|       
| - bbh_cot_fewshot_snarks                                 |      2|get-answer|     3|exact_match|↑  |0.8764|±  |0.0247|       
| - bbh_cot_fewshot_sports_understanding                   |      2|get-answer|     3|exact_match|↑  |0.9080|±  |0.0183|       
| - bbh_cot_fewshot_temporal_sequences                     |      2|get-answer|     3|exact_match|↑  |0.9960|±  |0.0040|       
| - bbh_cot_fewshot_tracking_shuffled_objects_five_objects |      2|get-answer|     3|exact_match|↑  |0.9160|±  |0.0176|       
| - bbh_cot_fewshot_tracking_shuffled_objects_seven_objects|      2|get-answer|     3|exact_match|↑  |0.9400|±  |0.0151|       
| - bbh_cot_fewshot_tracking_shuffled_objects_three_objects|      2|get-answer|     3|exact_match|↑  |0.9440|±  |0.0146|
| - bbh_cot_fewshot_web_of_lies                            |      2|get-answer|     3|exact_match|↑  |1.0000|±  |0.0000|
| - bbh_cot_fewshot_word_sorting                           |      2|get-answer|     3|exact_match|↑  |0.6680|±  |0.0298|

|Groups|Version|  Filter  |n-shot|  Metric   |   |Value |   |Stderr|
|------|-------|----------|-----:|-----------|---|-----:|---|-----:|
|bbh   |N/A    |get-answer|     3|exact_match|↑  |0.8036|±  |0.0044|

@CLAassistant
Copy link

CLAassistant commented Jun 24, 2024

CLA assistant check
All committers have signed the CLA.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants