Test-Time Reasoners Are Strategic Multiple-Choice Test-Takers

📅 2025-10-09

📈 Citations: 0

✨ Influential: 0

career value

147K/year

🤖 AI Summary

It remains unclear whether large language models (LLMs) rely on deep reasoning or exploit superficial cues—such as answer options—in multiple-choice question answering (MCQA). Method: We propose a faithfulness-aware analysis framework based on reasoning trajectories, comparing model behavior under full-input versus option-only conditions. Using test-time reasoning strategies, we integrate completeness verification and trajectory inspection to distinguish spurious data shortcuts from legitimate inference. Contribution/Results: Contrary to conventional assumptions, successful option-only inference is not inherently defective: models often reconstruct missing question content and perform valid reasoning. Experiments show that our method improves accuracy on option-only inputs in ~50% of cases, with minimal impact from reasoning length. These findings demonstrate that LLMs possess non-superficial, reasonably robust reasoning capabilities—challenging the prevailing view that partial-input success necessarily indicates flawed reasoning.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) now give reasoning before answering, excelling in tasks like multiple-choice question answering (MCQA). Yet, a concern is that LLMs do not solve MCQs as intended, as work finds LLMs sans reasoning succeed in MCQA without using the question, i.e., choices-only. Such partial-input success is often deemed problematic, but reasoning traces could reveal if these strategies are truly shallow in choices-only settings. To study these strategies, reasoning LLMs solve MCQs in full and choices-only inputs; test-time reasoning often boosts accuracy on full and in choices-only half the time. While possibly due to shallow shortcuts, choices-only success is barely affected by the length of reasoning traces, and after finding traces pass faithfulness tests, we show they use less problematic strategies like inferring missing questions. In all, we challenge claims that partial-input success is always a flaw, so we discuss how reasoning traces could separate problematic data from less problematic reasoning.

Problem

Research questions and friction points this paper is trying to address.

Investigating whether LLMs use shallow strategies in multiple-choice questions

Analyzing how reasoning traces affect accuracy in choices-only settings

Challenging the assumption that partial-input success is always problematic

Innovation

Methods, ideas, or system contributions that make the work stand out.

Test-time reasoning boosts multiple-choice accuracy

Choices-only analysis reveals non-shallow reasoning strategies

Reasoning traces distinguish problematic from valid strategies

🔎 Similar Papers

No similar papers found.