Fix regexp parsing for bbh_cot_fewshot #2013

arkapal3 · 2024-06-24T10:47:42Z

Hello,

The regexp match for answer parsing for CoT fewshot on BBH is too strict; in particular, I have found that for Qwen2-72B-Instruct, the score improves from 41.4% to 80.4% with this fix applied (num_fewshot = 3). Full results are below; on some BBH subtasks such as boolean_expressions, the score increases from 0% to 96%.

The issue arises due to Qwen2 producing an extra space at the end of its generation, like so: So the answer is (B). . This is then parsed with the existing regexp as (B). which fails the exact match to the expected answer (B).

Full comparison before and after fix below. Before:


|                          Tasks                           |Version|  Filter  |n-shot|  Metric   |   |Value |   |Stderr|
|----------------------------------------------------------|-------|----------|-----:|-----------|---|-----:|---|-----:|
|bbh                                                       |N/A    |get-answer|     3|exact_match|↑  |0.4138|±  |0.0047|
| - bbh_cot_fewshot_boolean_expressions                    |      2|get-answer|     3|exact_match|↑  |0.0000|±  |0.0000|
| - bbh_cot_fewshot_causal_judgement                       |      2|get-answer|     3|exact_match|↑  |0.1444|±  |0.0258|
| - bbh_cot_fewshot_date_understanding                     |      2|get-answer|     3|exact_match|↑  |0.6000|±  |0.0310|
| - bbh_cot_fewshot_disambiguation_qa                      |      2|get-answer|     3|exact_match|↑  |0.4160|±  |0.0312|
| - bbh_cot_fewshot_dyck_languages                         |      2|get-answer|     3|exact_match|↑  |0.0720|±  |0.0164|
| - bbh_cot_fewshot_formal_fallacies                       |      2|get-answer|     3|exact_match|↑  |0.0280|±  |0.0105|
| - bbh_cot_fewshot_geometric_shapes                       |      2|get-answer|     3|exact_match|↑  |0.2240|±  |0.0264|
| - bbh_cot_fewshot_hyperbaton                             |      2|get-answer|     3|exact_match|↑  |0.2920|±  |0.0288|
| - bbh_cot_fewshot_logical_deduction_five_objects         |      2|get-answer|     3|exact_match|↑  |0.6560|±  |0.0301|
| - bbh_cot_fewshot_logical_deduction_seven_objects        |      2|get-answer|     3|exact_match|↑  |0.4520|±  |0.0315|
| - bbh_cot_fewshot_logical_deduction_three_objects        |      2|get-answer|     3|exact_match|↑  |0.8480|±  |0.0228|
| - bbh_cot_fewshot_movie_recommendation                   |      2|get-answer|     3|exact_match|↑  |0.4160|±  |0.0312|
| - bbh_cot_fewshot_multistep_arithmetic_two               |      2|get-answer|     3|exact_match|↑  |0.0000|±  |0.0000|
| - bbh_cot_fewshot_navigate                               |      2|get-answer|     3|exact_match|↑  |0.7440|±  |0.0277|
| - bbh_cot_fewshot_object_counting                        |      2|get-answer|     3|exact_match|↑  |0.0240|±  |0.0097|
| - bbh_cot_fewshot_penguins_in_a_table                    |      2|get-answer|     3|exact_match|↑  |0.5068|±  |0.0415|
| - bbh_cot_fewshot_reasoning_about_colored_objects        |      2|get-answer|     3|exact_match|↑  |0.7280|±  |0.0282|
| - bbh_cot_fewshot_ruin_names                             |      2|get-answer|     3|exact_match|↑  |0.7880|±  |0.0259|
| - bbh_cot_fewshot_salient_translation_error_detection    |      2|get-answer|     3|exact_match|↑  |0.4160|±  |0.0312|
| - bbh_cot_fewshot_snarks                                 |      2|get-answer|     3|exact_match|↑  |0.8090|±  |0.0295|
| - bbh_cot_fewshot_sports_understanding                   |      2|get-answer|     3|exact_match|↑  |0.0000|±  |0.0000|
| - bbh_cot_fewshot_temporal_sequences                     |      2|get-answer|     3|exact_match|↑  |0.5840|±  |0.0312|
| - bbh_cot_fewshot_tracking_shuffled_objects_five_objects |      2|get-answer|     3|exact_match|↑  |0.8760|±  |0.0209|
| - bbh_cot_fewshot_tracking_shuffled_objects_seven_objects|      2|get-answer|     3|exact_match|↑  |0.6560|±  |0.0301|
| - bbh_cot_fewshot_tracking_shuffled_objects_three_objects|      2|get-answer|     3|exact_match|↑  |0.8720|±  |0.0212|
| - bbh_cot_fewshot_web_of_lies                            |      2|get-answer|     3|exact_match|↑  |0.0000|±  |0.0000|
| - bbh_cot_fewshot_word_sorting                           |      2|get-answer|     3|exact_match|↑  |0.1040|±  |0.0193|

|Groups|Version|  Filter  |n-shot|  Metric   |   |Value |   |Stderr|
|------|-------|----------|-----:|-----------|---|-----:|---|-----:|
|bbh   |N/A    |get-answer|     3|exact_match|↑  |0.4138|±  |0.0047|

After:

|                          Tasks                           |Version|  Filter  |n-shot|  Metric   |   |Value |   |Stderr|       
|----------------------------------------------------------|-------|----------|-----:|-----------|---|-----:|---|-----:|       
|bbh                                                       |N/A    |get-answer|     3|exact_match|↑  |0.8036|±  |0.0044|       
| - bbh_cot_fewshot_boolean_expressions                    |      2|get-answer|     3|exact_match|↑  |0.9640|±  |0.0118|       
| - bbh_cot_fewshot_causal_judgement                       |      2|get-answer|     3|exact_match|↑  |0.6684|±  |0.0345|       
| - bbh_cot_fewshot_date_understanding                     |      2|get-answer|     3|exact_match|↑  |0.8000|±  |0.0253|       
| - bbh_cot_fewshot_disambiguation_qa                      |      2|get-answer|     3|exact_match|↑  |0.8360|±  |0.0235|       
| - bbh_cot_fewshot_dyck_languages                         |      2|get-answer|     3|exact_match|↑  |0.3040|±  |0.0292|       
| - bbh_cot_fewshot_formal_fallacies                       |      2|get-answer|     3|exact_match|↑  |0.7480|±  |0.0275|       
| - bbh_cot_fewshot_geometric_shapes                       |      2|get-answer|     3|exact_match|↑  |0.4960|±  |0.0317|       
| - bbh_cot_fewshot_hyperbaton                             |      2|get-answer|     3|exact_match|↑  |0.9440|±  |0.0146|       
| - bbh_cot_fewshot_logical_deduction_five_objects         |      2|get-answer|     3|exact_match|↑  |0.6800|±  |0.0296|       
| - bbh_cot_fewshot_logical_deduction_seven_objects        |      2|get-answer|     3|exact_match|↑  |0.4720|±  |0.0316|       
| - bbh_cot_fewshot_logical_deduction_three_objects        |      2|get-answer|     3|exact_match|↑  |0.9200|±  |0.0172|       
| - bbh_cot_fewshot_movie_recommendation                   |      2|get-answer|     3|exact_match|↑  |0.7800|±  |0.0263|       
| - bbh_cot_fewshot_multistep_arithmetic_two               |      2|get-answer|     3|exact_match|↑  |0.9760|±  |0.0097|       
| - bbh_cot_fewshot_navigate                               |      2|get-answer|     3|exact_match|↑  |0.9520|±  |0.0135|       
| - bbh_cot_fewshot_object_counting                        |      2|get-answer|     3|exact_match|↑  |0.9480|±  |0.0141|       
| - bbh_cot_fewshot_penguins_in_a_table                    |      2|get-answer|     3|exact_match|↑  |0.5753|±  |0.0410|       
| - bbh_cot_fewshot_reasoning_about_colored_objects        |      2|get-answer|     3|exact_match|↑  |0.8120|±  |0.0248|       
| - bbh_cot_fewshot_ruin_names                             |      2|get-answer|     3|exact_match|↑  |0.8760|±  |0.0209|       
| - bbh_cot_fewshot_salient_translation_error_detection    |      2|get-answer|     3|exact_match|↑  |0.5880|±  |0.0312|       
| - bbh_cot_fewshot_snarks                                 |      2|get-answer|     3|exact_match|↑  |0.8764|±  |0.0247|       
| - bbh_cot_fewshot_sports_understanding                   |      2|get-answer|     3|exact_match|↑  |0.9080|±  |0.0183|       
| - bbh_cot_fewshot_temporal_sequences                     |      2|get-answer|     3|exact_match|↑  |0.9960|±  |0.0040|       
| - bbh_cot_fewshot_tracking_shuffled_objects_five_objects |      2|get-answer|     3|exact_match|↑  |0.9160|±  |0.0176|       
| - bbh_cot_fewshot_tracking_shuffled_objects_seven_objects|      2|get-answer|     3|exact_match|↑  |0.9400|±  |0.0151|       
| - bbh_cot_fewshot_tracking_shuffled_objects_three_objects|      2|get-answer|     3|exact_match|↑  |0.9440|±  |0.0146|
| - bbh_cot_fewshot_web_of_lies                            |      2|get-answer|     3|exact_match|↑  |1.0000|±  |0.0000|
| - bbh_cot_fewshot_word_sorting                           |      2|get-answer|     3|exact_match|↑  |0.6680|±  |0.0298|

|Groups|Version|  Filter  |n-shot|  Metric   |   |Value |   |Stderr|
|------|-------|----------|-----:|-----------|---|-----:|---|-----:|
|bbh   |N/A    |get-answer|     3|exact_match|↑  |0.8036|±  |0.0044|

CLAassistant · 2024-06-24T10:47:49Z

All committers have signed the CLA.

Fix regexp parsing for bbh_cot_fewshot.

3e75c5e

arkapal3 requested review from haileyschoelkopf and lintangsutawika as code owners June 24, 2024 10:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix regexp parsing for bbh_cot_fewshot #2013

Fix regexp parsing for bbh_cot_fewshot #2013

arkapal3 commented Jun 24, 2024

CLAassistant commented Jun 24, 2024 •

edited

Loading

Fix regexp parsing for bbh_cot_fewshot #2013

Are you sure you want to change the base?

Fix regexp parsing for bbh_cot_fewshot #2013

Conversation

arkapal3 commented Jun 24, 2024

CLAassistant commented Jun 24, 2024 • edited Loading

CLAassistant commented Jun 24, 2024 •

edited

Loading