🤖 AI Summary
This study challenges the validity of multiple-choice question answering (MCQA) as a proxy for evaluating the reasoning capabilities of large language models (LLMs), particularly state-of-the-art reasoning models. Method: We conduct a systematic analysis across 15 MCQA benchmarks and 25 LLMs, varying five question-presentation formats—including option placement (pre- vs. post-stem), option visibility, and response constraints—to isolate how presentation biases affect performance. Results: We find that models heavily exploit option ordering and content for “test-wise” inference rather than genuine reasoning; performance improves substantially when options are presented after reasoning (vs. free-form generation), exposing severe evaluation bias in standard MCQA. Crucially, our work provides the first empirical evidence that the timing of chain-of-thought (CoT) prompting—i.e., whether reasoning is elicited before or after option exposure—critically compromises evaluation fairness. These findings undermine MCQA’s suitability as a downstream task proxy and call for new, de-biased, temporally controlled reasoning evaluation paradigms.
📝 Abstract
When evaluating Large Language Models (LLMs) in question-answering domains, it is common to ask the model to choose among a fixed set of choices (so-called multiple-choice question-answering, or MCQA). Although downstream tasks of interest typically do not provide systems with explicit options among which to choose, this approach is nevertheless widely used because it makes it makes automatic grading straightforward and has tended to produce challenging benchmarks that correlate sufficiently well with downstream performance. This paper investigates the extent to which this trend continues to hold for state-of-the-art reasoning models, describing a systematic evaluation of $15$ different question-answering benchmarks (e.g., MMLU, HLE) and $25$ different LLMs (including small models such as Qwen 7B and relatively large models such as Llama 70B). For each model-benchmark pair, we considered $5$ ways of presenting the model with questions, including variations on whether multiple choices were offered to the model at all; whether "none of the above" sometimes replaced the right answer; and whether the model was permitted to perform chain-of-thought reasoning before and/or after the choices were presented. MCQA remained a good proxy for the downstream performance of models as long as they were allowed to perform chain-of-thought reasoning only before being presented with the options among which they had to select. On the other hand, large models that were able to perform reasoning after being given a set of options tended to significantly outperform their free-text performance due to exploiting the information in the options. We conclude that MCQA is no longer a good proxy for assessing downstream performance of state-of-the-art models, and offer practical guidelines for designing more robust, bias-resistant benchmarks that better reflect LLMs' genuine reasoning capabilities.