Garbage In, Reasoning Out? Why Benchmark Scores are Unreliable and What to Do About It

📅 2025-06-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work systematically audits three major commonsense reasoning benchmarks—SocialIQa, FauxPas-EAI, and ToMi—and exposes critical flaws in item design and evaluation methodology: current automatic scoring overemphasizes superficial output formatting, making evaluations vulnerable to spurious format-based cues and unable to reliably assess LLMs’ genuine reasoning capabilities. Method: We propose a new evaluation paradigm centered on “reasoning process consistency,” prioritizing logically robust, information-grounded inference over surface-level correctness. To support this, we release a human-verified, re-annotated clean dataset and a diagnostic toolkit. Contribution/Results: Through multi-round diagnostic evaluations across GPT-3/3.5/4/o1 and LLaMA 3.1, we demonstrate that apparent performance gains largely stem from input perturbations rather than substantive reasoning improvements. Our framework establishes a foundation for interpretable, reproducible, process-oriented reasoning evaluation—advancing both theoretical understanding and practical assessment rigor.

Technology Category

Application Category

📝 Abstract
We conduct a systematic audit of three widely used reasoning benchmarks, SocialIQa, FauxPas-EAI, and ToMi, and uncover pervasive flaws in both benchmark items and evaluation methodology. Using five LLMs (GPT-{3, 3.5, 4, o1}, and LLaMA 3.1) as diagnostic tools, we identify structural, semantic, and pragmatic issues in benchmark design (e.g., duplicated items, ambiguous wording, and implausible answers), as well as scoring procedures that prioritize output form over reasoning process. Through systematic human annotation and re-evaluation on cleaned benchmark subsets, we find that model scores often improve not due to due to erratic surface wording variations and not to improved reasoning. Infact, further analyses show that model performance is highly sensitive to minor input variations such as context availability and phrasing, revealing that high scores may reflect alignment with format-specific cues rather than consistent inference based on the input. These findings challenge the validity of current benchmark-based claims about reasoning in LLMs, and highlight the need for evaluation protocols that assess reasoning as a process of drawing inference from available information, rather than as static output selection. We release audited data and evaluation tools to support more interpretable and diagnostic assessments of model reasoning.
Problem

Research questions and friction points this paper is trying to address.

Audit reveals flaws in reasoning benchmarks' design and evaluation
Model scores improve due to wording, not reasoning capability
Benchmark claims on LLM reasoning lack validity and reliability
Innovation

Methods, ideas, or system contributions that make the work stand out.

Systematic audit of reasoning benchmarks flaws
LLMs as diagnostic tools for benchmark issues
Human annotation for cleaned benchmark evaluation
🔎 Similar Papers
No similar papers found.
S
Seyed Mahed Mousavi
Signals and Interactive Systems Lab, University of Trento, Italy
E
Edoardo Cecchinato
Signals and Interactive Systems Lab, University of Trento, Italy
L
Lucia Hornikova
Masaryk University, Czech Republic
Giuseppe Riccardi
Giuseppe Riccardi
Professor of Computer Science, University of Trento Italy
Natural Language ProcessingSpeech ProcessingDialogueMachine Learning