ACCORD: Closing the Commonsense Measurability Gap

Overview

We present ACCORD, a framework for generating Anti-faCtual COmmonsense Reasoning Disentanglement benchmarks. ACCORD tightly controls fine-grained counterfactual variants of commonsense reasoning tasks to enable detailed analysis of large language model (LLM) performance factors.

Our understanding of LLMs' commonsense reasoning abilities is severely lagging compared to our understanding of their formal reasoning abilities, such as in logic or math. Specifically, commonsense benchmarks are difficult to construct in a manner that is rigorously quantifiable. As a result, prior commonsense reasoning benchmarks and datasets are limited to one- or two-hop reasoning or include an unknown (i.e., non-measurable) number of reasoning hops and/or distractors. As such, key players in AI have singled out commonsense as a critical new frontier.

ACCORD introduces formal elements to commonsense reasoning, and thus takes a significant step towards closing the commonsense measurability gap with respect to formal reasoning. In particular, ACCORD disentangles commonsense grounding and reasoning abilities in LLMs, while controlling for both reasoning complexity (via increased reasoning hops), reasoning skills (since some skills may be harder than others to learn), and distractors (since real-world text is imperfect and often contains distracting elements). The ability to quantify and control these factors is ubiquitous in formal reasoning benchmarks, but overwhelmingly lacking in commonsense reasoning benchmarks.

ACCORD instances are grounded counterfactually (more accurate, anti-factually) to mitigate LLMs' ability to spuriously 'guess' correct answers thereby shortcutting the intended reasoning task. In addition, ACCORD is uniquely designed to automatically scale its difficulty level in tandem with future LLM improvements by leveraging compositional scalability to generate future benchmarks of arbitrary reasoning complexity with minimal additional human effort. Arbitrary scalability via compositional construction is somewhat typical of formal reasoning tasks but lacking in commonsense reasoning.

ACCORD CSQA is an extension of the popular CommonsenseQA (CSQA) dataset using the ACCORD framework. Benchmarking state-of-the-art LLMs, including GPT 4o, LLama 3 70B, and Mixtral 8x22B Instruct, shows performance degrading to random chance with only moderate scaling of reasoning complexity, leaving substantial headroom for improvement. As such, we scale ACCORD CSQA from problem size 0 to problem size 5, providing the benchmark suite ACCORD CSQA 0-5. We release this benchmark suite as a standalone benchmark as well as through a competitive leaderboard, both of which can be found here. This leaderboard serves as a hub for recording LLM performance on this benchmark suite.

This source code repository contains the implementation of ACCORD as applied to ACCORD CSQA. The steps below can be used to re-generate ACCORD CSQA 0-5 as well as for automatically generating more complex benchmarks in the future. Additional details on both ACCORD and ACCORD CSQA can be found in our paper.

Framework

The Figure below demonstrates how the ACCORD framework applies to a QA dataset, such as CSQA. The top row consists of manual preprocessing of chosen dataset. This work only needs to be done once. The bottom row demonstrates the fully automated steps based on this preprocessing. Step 1: Combinatorially generate all possible reasoning trees based on the chosen reasoning skills. Step 2: Pair each QA instance to all matching trees using a pairing template. Step 3: For each paired tree, find all n-hop reasoning paths based on validated skill reductions. Step 4: For each reasoning path, duplicate the tree for each answer choice in the QA instance, then anti-factually ground all the variables in the trees.

Below, we detail how each of these steps can be reproduced from this code base.

Config Files and Commands

Each step below requires running one or more commands. Each command requires one or more config files. All config files can be found in configs/CSQA_and_ConceptNet/. The commands below implicitly pull load the correct config file assuming you run the command from within the configs/CSQA_and_ConceptNet directory.

To run a command, clone this repository, then pip install -r requirements.txt dependencies, then point your PYTHONPATH to both accord/src and to wherever you installed the dependencies. Or do the equivalent steps in your preferred Python environment manager.

All config files are YAML-formatted files that derive from a config dataclass defined in the code base. The dataclass in code tends to define 'safe' defaults. The config YAML files may or may not override these defaults. When the two clash, the value in the YAML file overrides the default. Command-line values, in turn, can override both defaults and YAML values, which is sometimes valuable. All configs use OmegaConf, and so support variable interpolation.

All commands below necessarily use the resources.yaml and general.yaml configs, which are defined and explained in the code here. Each command may also require additional configs on a individual basis.

Preprocessing Existing Assets

Both ConceptNet and CSQA require some preprocessing. CSQA derives from ConceptNet (see the CSQA paper here). Specifically, CSQA builds from the relation types of tables in ConceptNet. ACCORD CSQA expands these relation types into full reasoning templates based on CSQA reasoning skills (see the paper for details). Some pairs of reasoning skills reduce to simpler assertions. Valid relation types and reductions are manually vetted as a preprocessing step, and the results can be found here and here, respectively.

NOTE: All preprocessing results are already in data/ConceptNet/ and data/CSQA/ so these steps can be skipped in practice. Instructions are provided here only for completeness.

1. Preprocessing ConceptNet

ConceptNet needs to be preprocessed from its raw form to a form more suitable for easy access with ACCORD. Download the raw data here and place it in data/ConceptNet/raw/ (which is too big to sensibly store on GitHub). Then, run:

cd configs/CSQA_and_ConceptNet
python ../../src/main.py preprocess.conceptnet

Implicitly, this command pulls from conceptnet.yaml, which is defined here.

This extracts the subset of ConceptNet needed for ACCORD CSQA and downsizes space requirements from about 10GB down to 16MB. The resulting files are stored in data/ConceptNet/preprocessed/.

2. Preprocessing CSQA

CSQA preprocessing occurs in four steps, three of which are automated. Implicitly, all three automated commands pull from csqa.yaml, which is defined here.

First, the ConceptNet relation of each CSQA instance must be inferred. Download the raw data of the CSQA development set here and place it in data/CSQA/raw/ (which is too big to sensibly store on GitHub). Then, run:

cd configs/CSQA_and_ConceptNet
python ../../src/main.py preprocess.csqa.infer

Implicitly, this command also pulls from conceptnet.yaml, which is defined here.

Second, the CSQA instances are sub-sampled to balance the classes of the base instances from which ACCORD CSQA will derive. Run:

cd configs/CSQA_and_ConceptNet
python ../../src/main.py preprocess.csqa.sample

Third, is the manual step. Pairing templates (see the paper for details) must be hand-crafted for (at least some of) the sub-sampled CSQA instances. We have done this work already. The results can be found in here.

Finally, the sub-sampling and inference steps are combined and converted to JSON objects that are used throughout ACCORD. Any CSQA instances that are not defined in both places are silently ignored. The results are stored in data/CSQA/converted/. Run:

cd configs/CSQA_and_ConceptNet
python ../../src/main.py preprocess.csqa.convert

This ends the preprocessing.

ACCORD CSQA Benchmark Generation

From the preprocessed data, ACCORD CSQA 0-5 can be generated entirely automatically. First, generic trees are combinatorially generated. These represent reasoning trees in their most abstract form. Second, each generic tree is combinatorially instantiated into one or more relational trees using registered relation types and validated reductions. Third, each relational tree is paired with every matching CSQA instance to form a so-called forest and then the variables of each tree in each forest are anti-factually grounded. Finally, individual grounded trees are grouped together based on similarity criteria.

All these steps are then repeated for each problem size (i.e., tree_size) from 2 to 5. Problem sizes 0 and 1 naturally fall out of the preprocessing steps above and don't require any supplemental generation step. To generate ACCORD CSQA 2-5, run (in Bash-like syntax):

cd configs/CSQA_and_ConceptNet
for tree_size in {2..5} ; do
  python ../../src/main.py generate.generic \
    tree_size=$tree_size --filter-path filters/filter_"$tree_size".yaml
  python ../../src/main.py generate.relational \
    tree_size=$tree_size --filter-path filters/filter_"$tree_size".yaml
  python ../../src/main.py generate.forest.csqa.conceptnet \
    tree_size=$tree_size --filter-path filters/filter_"$tree_size".yaml
  python ../../src/main.py generate.group.csqa.conceptnet \
    tree_size=$tree_size
done

Explicitly, the first 3 commands pull from a size-specific instance of filter.yaml, which is defined in code here. Since the combinatorial generation process is intractable, we probabilistically filter out candidate trees and forests to keep only a representative sub-sample. In general, our goal is to filter out the fewest samples possible while keeping both the runtime and final dataset size reasonable. As such, we filter exponentially aggressively with increasing problem size. Hence, size-specific filtering configs.

Implicitly, the generate.forest.csqa.conceptnet command also pulls from csqa.yaml, which is defined here, from conceptnet.yaml, which is defined here, and from beam_search.yaml and sorter.yaml, which are both defined here.

Implicitly, the generate.group.csqa.conceptnet command also pulls from csqa.yaml, and from beam_search.yaml and mapping_distance.yaml, which are both defined here.

NOTE: For a given problem size, each generation step is dependent on the previous. However, across problem sizes, the generation is entirely independent. As such, data parallelism can be achieved, in practice, by splitting up the problem sizes. Here, we have shown them all together for compactness.

Reproducing Experimental Results

There are two main steps required to generated the experimental results found in the paper. First, we benchmark various LLMs against ACCORD CSQA 0-5. Second, we analyze those results and make human-legible figures from them.

1. Benchmarking LLMs

We benchmarked 10 LLMs in our paper. Each LLM has slightly different requirements. Each LLM's specific requirements are stored as command-line overrides to specific config values. Each can be found here, with file names matching the corresponding LLM.

As with generation, benchmarking LLMs is done separately for each problem size, and so also requires a different filter.yaml config, using the same principle as described there.

OpenAI Models

To benchmark any OpenAI models (we benchmarked gpt-4o-2024-05-13 and gpt-3.5-turbo-0125), run:

cd configs/CSQA_and_ConceptNet

# Choose an LLM to benchmark.
llm=name_of_llm  # E.g., gpt-4o-2024-05-13

# Load config overrides specific to the chosen LLM.
mapfile -t < arguments/"$llm".txt
llm_args="${MAPFILE[@]}"

# Tree size 0 (i.e., the baseline) needs special treatment.
python ../../src/main.py prompt.openai.csqa.conceptnet \
  llm=$llm $llm_args tree_size=0 \
  --filter-path filters/filter_0.yaml \
  --surfacer-path surfacer_tree_size_0.yaml \
  --openai-path openai_tree_size_0.yaml

# All other tree sizes use the same surfacer and openai configs.
for tree_size in {1..5} ; do
  python ../../src/main.py prompt.openai.csqa.conceptnet \
    llm=$llm $llm_args tree_size=$tree_size \
    --filter-path filters/filter_"$tree_size".yaml
done

Replace name_of_llm with a value that the OpenAI API will accept.

Implicitly, this command also pulls from csqa.yaml, which is defined here, from openai.yaml or openai_tree_size_0.yaml, which are variants of the same config defined here, from surfacer.yaml or surfacer_tree_size_0.yaml, which are variants of the same config defined here, and from snapshot.yaml, which is defined here.

Local Models Using VLLM

To benchmark local models (we benchmarked Llama-2-13b-chat-hf, Llama-2-70b-chat-hf, Meta-Llama-3-8B-Instruct, Meta-Llama-3-70B-Instruct, Mixtral-8x7B-Instruct-v0.1, and Mixtral-8x22B-Instruct-v0.1, all downloaded from the HuggingFace Model Hub) using a VLLM server, we use the OpenAI ChatCompletion API. To start local VLLM servers, install vector-inference, then run:

bash vector-inference/models/llama2/launch_server.sh -v 13b-chat -n 2
bash vector-inference/models/llama2/launch_server.sh -v 70b-chat -n 4

bash vector-inference/models/llama3/launch_server.sh -v 8B-Instruct -n 4
bash vector-inference/models/llama3/launch_server.sh -v 70B-Instruct -n 4

bash vector-inference/models/mixtral/launch_server.sh -v 8x7B-Instruct-v0.1 -n 4
bash vector-inference/models/mixtral/launch_server.sh -v 8x22B-Instruct-v0.1 -N 2 -n 4

After all VLLM servers are up and running, run the same command as the OpenAI Models. Replace name_of_llm with the path to the local model top-level directory in your environment. You may need to tweak the path on the mapfile -t < arguments/"$llm".txt line as a result.

Other Local Models

To benchmark any models not supported by vector-inference (in our case, gemma-7b-it and Mistral-7B-Instruct-v0.1), we used the HuggingFace Transformers Pipeline API. Run:

cd configs/CSQA_and_ConceptNet

# Choose an LLM to benchmark.
llm=name_of_llm  # E.g., gemma-7b-it

# Load config overrides specific to the chosen LLM.
mapfile -t < arguments/"$llm".txt
llm_args="${MAPFILE[@]}"

# Tree size 0 (i.e., the baseline) needs special treatment.
python ../../src/main.py prompt.transformers.csqa.conceptnet \
  llm=$llm $llm_args tree_size=0 \
  --filter-path filters/filter_0.yaml \
  --surfacer-path surfacer_tree_size_0.yaml \
  --transformers-path transformers_tree_size_0.yaml

# All other tree sizes use the same surfacer and transformers configs.
for tree_size in {1..5} ; do
  python ../../src/main.py prompt.transformers.csqa.conceptnet \
    llm=$llm $llm_args tree_size=$tree_size \
    --filter-path filters/filter_"$tree_size".yaml
done

Replace name_of_llm with a value that the Pipeline API will accept based on the path to the local model top-level directory in your environment. You may need to tweak the path on the mapfile -t < arguments/"$llm".txt line as a result.

Implicitly, this command also pulls from csqa.yaml, which is defined here, from transformers.yaml or transformers_tree_size_0.yaml, which are variants of the same config defined here, from surfacer.yaml or surfacer_tree_size_0.yaml, which are variants of the same config defined here, and from snapshot.yaml, which is defined here.

2. Analyzing LLM Results

From the LLM benchmarking results generated above, we analyze those results and collate them into human-legible figures. There are two types of analysis. First, 'basic' analysis simply computes LLM performance broken down by reasoning complexity and factuality. Second, 'iteraction' analysis computes the interaction effect between reasoning hops and distractors broken down by factuality.

NOTE: Some of the configs relating to the commands below having LLM names with path prefixes, such as model-weights/gemma-7b-it for the gemma-7b-it model. These paths are based on the local directory where models in our environment are stored and may need to be changed in your own environment.

Performing Analysis

To perform basic analysis, run:

cd configs/CSQA_and_ConceptNet
python ../../src/main.py analyze.basic.csqa.conceptnet \
  analysis_file='${analysis_dir}'/analysis_basic.jsonl

To perform interaction analysis, run:

cd configs/CSQA_and_ConceptNet
python ../../src/main.py analyze.interaction.csqa.conceptnet \
  analysis_file='${analysis_dir}'/analysis_interaction.jsonl

Implicitly, both commands pull from csqa.yaml, which is defined here, and from analysis_loader.yaml, which is defined here.

The analyze.basic.csqa.conceptnet command also pulls from basic_analysis.yaml, which is defined here. Note that TREE_SIZE is commented out in the bin_types field config file. This is to exactly reproduce the Figures in the paper. It can be uncommented to include tree size as an analysis result, though it is not qualitatively very different from the reasoning hops (hence, its exclusion from the paper).

Making Figures

To make basic analysis figures, run:

cd configs/CSQA_and_ConceptNet
python ../../src/main.py analyze.basic.pretty \
  analysis_file='${analysis_dir}'/analysis_basic.jsonl

Implicitly, this command pulls from basic_analysis.yaml, which is defined here, and from basic_pretty.yaml, which is defined here.

To make interaction analysis figures, run:

cd configs/CSQA_and_ConceptNet
python ../../src/main.py analyze.interaction.pretty \
  analysis_file='${analysis_dir}'/analysis_interaction.jsonl

Implicitly, this command pulls from interaction_pretty.yaml, which is defined here.

The figures are modular and are meant to fit together, as seen in the paper. Hints for swapping between main paper figure generation and supplemental paper figure generation are given as comments in the config files.

Generating Official Benchmark Releases (i.e., "snapshots")

To generate the official release of ACCORD CSQA 0-5, we take "snapshots" of each problem size as they would be fed to LLM (but using a so-called "dummy" LLM instead). Using a dummy LLM allows us to essentially reuse the LLM prompting code with minimal modifications.

There are two official releases of ACCORD CSQA 0-5, one Small task size containing exactly one instance per reasoning hop per distractor per pairing (total size 2,864), and one Large task size containing a minimum of 10 instances per reasoning hop per distractor per pairing (total size 245,514). For most purposes, including the experimental results in the paper, ACCORD CSQA Small 0-5 is sufficient. As such, the filter.yaml configs in this repository are set to generate ACCORD CSQA Small 0-5 by default. To generate ACCORD CSQA Large 0-5, set default_num_llm_prompts=-1 in the code below. Run:

cd configs/CSQA_and_ConceptNet

# How many instance to keep per reasoning hop per distractor per pairing.
default_num_llm_prompts=1  # Set to -1 to keep ALL instances to generate ACCORD CSQA Large 0-5.

extension="_small"
if [ "$default_num_llm_prompts" = "-1" ]; then
  extension="_large"
fi

# Tree size 0 (i.e., the baseline) needs special treatment.
python ../../src/main.py prompt.dummy.csqa.conceptnet \
  llm=dummy snapshot::ignore=false tree_size=0 \
  default_num_llm_prompts=$default_num_llm_prompts \
  snapshot_dir='${result_dir}'/accord_csqa"$extension"
  snapshots_file='${snapshot_dir}'/accord_csqa"$extension"_'${tree_size}'.jsonl \
  --filter-path filters/filter_0.yaml \
  --surfacer-path surfacer_tree_size_0.yaml

# All other tree sizes use the same surfacer config.
for tree_size in {1..5} ; do
  python ../../src/main.py prompt.dummy.csqa.conceptnet \
    llm=dummy snapshot::ignore=false tree_size=$tree_size \
    default_num_llm_prompts=$default_num_llm_prompts \
    snapshot_dir='${result_dir}'/accord_csqa"$extension" \
    snapshots_file='${snapshot_dir}'/accord_csqa"$extension"_'${tree_size}'.jsonl \
    --filter-path filters/filter_"$tree_size".yaml
done

Implicitly, this command also pulls from csqa.yaml, which is defined here, from dummy.yaml, which is defined here, from surfacer.yaml or surfacer_tree_size_0.yaml, which are variants of the same config defined here, and from snapshot.yaml, which is defined here.

NOTE: dummy.yaml is, in practice, ignored when generating snapshots, but can be used as a debugging tool. The value does NOT affect the content of the official benchmark releases.

Tips for Benchmarking Other LLMs

For benchmarking other LLMs besides the 10 that were part of the experiments in the paper, you essentially have two options. The first and most recommended option is to download the official benchmark release from the official leaderboard website here and then run LLMs against these releases using your own code. We also encourage submission to the leaderboard to record your LLMs' performance against that of others. Instructions can be found on the leaderboard website. The second option is to clone this repository and then implement a subclass of LLM as an interface between your LLM and the benchmarking process discussed above. Use the existing implementation of OpenAI and Transformers models as a guide.

Random Samples from ACCORD CSQA 0-5

These are drawn from ACCORD CSQA Small (one per problem size from 0 to 5; shown in increasing order).

Instance ID: G50_0_E
Meta-data:    Reasoning Hops: 0    Distractors: 0    Problem Size: 0    Ground Truth Label: E
Instructions:
Answer the following multiple-choice question. Provide your answer in JSON format using the following schema: {"answer": <label>} where <label> is exactly one of: "A", "B", "C", "D", or "E". Do not output anything else.
Question:
He was on trial for obstructing justice, during which he made a questionable comment and was also found guilty of what?
A: prosecution    B: getting hurt    C: sweat    D: steam    E: committing perjury
Answer:

Instance ID: G68_1_B
Meta-data:    Reasoning Hops: 1    Distractors: 0    Problem Size: 1    Ground Truth Label: B
Instructions:
You will be provided with statements relating to a multiple-choice question. The contents of the statements may disagree with your prior knowledge of the world. That is ok. Your task is to provide the most appropriate answer to the multiple-choice question based on the reasoning presented in the statements. Provide your answer in JSON format using the following schema: {"answer": <label>} where <label> is exactly one of: "A", "B", "C", "D", or "E". Do not output anything else.
Statements:
- Suppose that [sitting_quietly] is not a part of [fall asleep]
- Suppose that [sitting_quietly] is a part of [meditate]
- Suppose that [sitting_quietly] is not a part of [reading]
- Suppose that [sitting_quietly] is not a part of [bunk]
- Suppose that [sitting_quietly] is not a part of [think]
Question:
What is someone doing if he or she is sitting quietly and his or her eyes are moving?
A: reading    B: meditate    C: fall asleep    D: bunk    E: think
Answer:

Instance ID: G16155_2_C
Meta-data:    Reasoning Hops: 2    Distractors: 0    Problem Size: 2    Ground Truth Label: C
Instructions:
You will be provided with statements relating to a multiple-choice question. The contents of the statements may disagree with your prior knowledge of the world. That is ok. Your task is to provide the most appropriate answer to the multiple-choice question based on the reasoning presented in the statements. Provide your answer in JSON format using the following schema: {"answer": <label>} where <label> is exactly one of: "A", "B", "C", "D", or "E". Do not output anything else.
Statements:
- Suppose that [serious] is a type of [toilet training product]
- Suppose that [longplay] is a type of [pinniped mammal]
- Suppose that [mammalogy] is a type of [boring activity]
- Suppose that [coccid insect] is a type of [boring activity]
- Suppose that [musical] is a type of [entree]
- Suppose that [pinniped mammal] is a type of [boring activity]
- Suppose that [toilet training product] is a type of [boring activity]
- Suppose that [eat cake] is a type of [coccid insect]
- Suppose that [doing nothing] is a type of [mammalogy]
- Suppose that [entree] is not a type of [boring activity]
Question:
Sarah didn't like to play but she didn't want to be sedentary and bored, either, so she took up what?
A: serious    B: longplay    C: musical    D: eat cake    E: doing nothing
Answer:

Instance ID: G7186_3_C
Meta-data:    Reasoning Hops: 3    Distractors: 0    Problem Size: 3    Ground Truth Label: C
Instructions:
You will be provided with statements relating to a multiple-choice question. The contents of the statements may disagree with your prior knowledge of the world. That is ok. Your task is to provide the most appropriate answer to the multiple-choice question based on the reasoning presented in the statements. Provide your answer in JSON format using the following schema: {"answer": <label>} where <label> is exactly one of: "A", "B", "C", "D", or "E". Do not output anything else.
Statements:
- Suppose that [each country] appears near [breakfast cereal]
- Suppose that [a steering wheel] does appear near [each country]
- Suppose that [vase] appears near [aetna]
- Suppose that [jumbo jet] appears near [preserved foods]
- Suppose that [parcel] appears near [train]
- Suppose that [display] appears near [parcel]
- Suppose that [a steering wheel] does not appear near [motels]
- Suppose that [a steering wheel] does not appear near [vase]
- Suppose that [preserved foods] appears near [drawer]
- Suppose that [traffic signs] appears near [firearm]
- Suppose that [aetna] appears near [keep cloesd]
- Suppose that [motels] appears near [traffic signs]
- Suppose that [breakfast cereal] appears near [ignition switch]
- Suppose that [a steering wheel] does not appear near [jumbo jet]
- Suppose that [a steering wheel] does not appear near [display]
Question:
The lock kept the steering wheel from moving, but the thief still took his chances and began to work on the what?
A: keep cloesd    B: train    C: ignition switch    D: drawer    E: firearm
Answer:

Instance ID: G13526_4_C
Meta-data:    Reasoning Hops: 4    Distractors: 0    Problem Size: 4    Ground Truth Label: C
Instructions:
You will be provided with statements relating to a multiple-choice question. The contents of the statements may disagree with your prior knowledge of the world. That is ok. Your task is to provide the most appropriate answer to the multiple-choice question based on the reasoning presented in the statements. Provide your answer in JSON format using the following schema: {"answer": <label>} where <label> is exactly one of: "A", "B", "C", "D", or "E". Do not output anything else.
Statements:
- Suppose that [kissing too long] does not cause [feet touch ground]
- Suppose that only [change] causes [charge]
- Suppose that only [understanding better] causes [dry pet food]
- Suppose that [strong feelings] is a type of [charge]
- Suppose that [arousal] is a type of [morning glory]
- Suppose that only [working and getting paid for] causes [understanding better]
- Suppose that only [feet touch ground] causes [lying]
- Suppose that [kissing too long] does cause [working and getting paid for]
- Suppose that [kissing too long] does not cause [hurting]
- Suppose that only [returning to work] causes [kingdom]
- Suppose that only [hurting] causes [driving]
- Suppose that [herpes] is a type of [kingdom]
- Suppose that [kissing too long] does not cause [leaving ring]
- Suppose that [excitement] is a type of [character set]
- Suppose that only [driving] causes [character set]
- Suppose that [kissing too long] does not cause [use brain]
- Suppose that [shortness of breath] is a type of [dry pet food]
- Suppose that only [use brain] causes [change]
- Suppose that only [leaving ring] causes [returning to work]
- Suppose that only [lying] causes [morning glory]
Question:
What happens if someone kisses too long?
A: strong feelings    B: herpes    C: shortness of breath    D: excitement    E: arousal
Answer:

Instance ID: G31713_5_D
Meta-data:    Reasoning Hops: 1    Distractors: 4    Problem Size: 5    Ground Truth Label: D
Instructions:
You will be provided with statements relating to a multiple-choice question. The contents of the statements may disagree with your prior knowledge of the world. That is ok. Your task is to provide the most appropriate answer to the multiple-choice question based on the reasoning presented in the statements. Provide your answer in JSON format using the following schema: {"answer": <label>} where <label> is exactly one of: "A", "B", "C", "D", or "E". Do not output anything else.
Statements:
- Suppose that [not losing weight] does not cause [beauty]
- Suppose that [protecting passport] is a type of [catholicism]
- Suppose that [loose skin] is used for [privacy]
- Suppose that [placing basketball] causes [not losing weight]
- Suppose that [privacy] is a type of [restraint]
- Suppose that [healthier] is used for [protecting passport]
- Suppose that [death] is used for [writing to friend or business]
- Suppose that [not losing weight] does not cause [loose skin]
- Suppose that [orient] causes [not losing weight]
- Suppose that [text string occurrence] is used for [achieving goal]
- Suppose that [catholicism] is used for [cook oatmeal]
- Suppose that [depression] causes [not losing weight]
- Suppose that [not losing weight] does not cause [healthier]
- Suppose that [not losing weight] does not cause [miss universe]
- Suppose that [writing to friend or business] is a type of [text string occurrence]
- Suppose that [using water colors] is a type of [vendor]
- Suppose that [invite people over] is a type of [sputnik]
- Suppose that [beauty] is used for [invite people over]
- Suppose that [watering lawn] causes [not losing weight]
- Suppose that [familiar sound] causes [not losing weight]
- Suppose that [not losing weight] does cause [death]
- Suppose that [sputnik] is used for [getting up in morning]
- Suppose that [miss universe] is used for [using water colors]
- Suppose that [vendor] is used for [transporting cargo]
- Suppose that [restraint] is used for [avoid sunburn]
Question:
What might happen if someone is not losing weight?
A: loose skin    B: beauty    C: miss universe    D: death    E: healthier
Answer:

Name		Name	Last commit message	Last commit date
Latest commit History 136 Commits
assets		assets
configs/CSQA_and_ConceptNet		configs/CSQA_and_ConceptNet
data		data
leaderboard		leaderboard
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
croissant_metadata.json		croissant_metadata.json
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ACCORD: Closing the Commonsense Measurability Gap

Overview

Framework

Config Files and Commands

Preprocessing Existing Assets

1. Preprocessing ConceptNet

2. Preprocessing CSQA

ACCORD CSQA Benchmark Generation

Reproducing Experimental Results

1. Benchmarking LLMs

OpenAI Models

Local Models Using VLLM

Other Local Models

2. Analyzing LLM Results

Performing Analysis

Making Figures

Generating Official Benchmark Releases (i.e., "snapshots")

Tips for Benchmarking Other LLMs

Random Samples from ACCORD CSQA 0-5

About

Releases

Packages

Languages

License

francois-rd/accord

Folders and files

Latest commit

History

Repository files navigation

ACCORD: Closing the Commonsense Measurability Gap

Overview

Framework

Config Files and Commands

Preprocessing Existing Assets

1. Preprocessing ConceptNet

2. Preprocessing CSQA

ACCORD CSQA Benchmark Generation

Reproducing Experimental Results

1. Benchmarking LLMs

OpenAI Models

Local Models Using VLLM

Other Local Models

2. Analyzing LLM Results

Performing Analysis

Making Figures

Generating Official Benchmark Releases (i.e., "snapshots")

Tips for Benchmarking Other LLMs

Random Samples from ACCORD CSQA 0-5

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages