BenchMarker: An Education-Inspired Toolkit for Highlighting Flaws in Multiple-Choice Benchmarks

📅 2026-02-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current multiple-choice question answering benchmarks often yield unreliable model evaluations due to issues such as data contamination, answer-option shortcuts, and structural or grammatical errors. This work proposes BenchMarker—the first automated auditing toolkit grounded in educational assessment theory—that systematically incorporates 19 item-quality rules and leverages large language models as evaluators, complemented by human validation, to detect three prevalent defect categories. Audits across 12 mainstream benchmarks reveal that contaminated items artificially inflate model accuracy, while flawed item construction significantly degrades performance and alters leaderboard rankings. Moreover, existing repair strategies, though partially effective, may inadvertently introduce new defects. By formally integrating psychometric item-quality standards from educational testing into NLP benchmark evaluation, this study establishes a novel paradigm for reliable and robust model assessment.

Technology Category

Application Category

📝 Abstract
Multiple-choice question answering (MCQA) is standard in NLP, but benchmarks lack rigorous quality control. We present BenchMarker, an education-inspired toolkit using LLM judges to flag three common MCQ flaws: 1) contamination - items appearing exactly online; 2) shortcuts - cues in the choices that enable guessing; and 3) writing errors - structural/grammatical issues based on a 19-rule education rubric. We validate BenchMarker with human annotations, then run the tool to audit 12 benchmarks, revealing: 2) contaminated MCQs tend to inflate accuracy, while writing errors tend to lower it and change rankings beyond random; and 3) prior benchmark repairs address their targeted issues (i.e., lowering accuracy with LLM-written distractors), but inadvertently add new flaws (i.e. implausible distractors, many correct answers). Overall, flaws in MCQs degrade NLP evaluation, but education research offers a path forward. We release BenchMarker to bridge the fields and improve MCQA benchmark design.
Problem

Research questions and friction points this paper is trying to address.

multiple-choice question answering
benchmark quality
evaluation flaws
NLP evaluation
question design
Innovation

Methods, ideas, or system contributions that make the work stand out.

BenchMarker
multiple-choice question answering
benchmark auditing
LLM judges
education-inspired evaluation
🔎 Similar Papers
No similar papers found.
N
Nishant Balepur
University of Maryland; New York University
B
Bhavya Rajasekaran
University of Maryland
J
Jane Oh
University of Maryland
M
Michael Xie
University of Maryland
A
Atrey Desai
University of Maryland
V
Vipul Gupta
Scale AI
Steven James Moore
Steven James Moore
Carnegie Mellon University
LearnersourcingEducational Data MiningEducational TechnologyNatural Language Processing
Eunsol Choi
Eunsol Choi
New York University
natural language processingmachine learning
Rachel Rudinger
Rachel Rudinger
Assistant Professor, Department of Computer Science, University of Maryland
J
Jordan Lee Boyd-Graber
University of Maryland