Fantastic Bugs and Where to Find Them in AI Benchmarks

📅 2025-11-20

📈 Citations: 0

✨ Influential: 0

career value

182K/year

🤖 AI Summary

Widespread presence of invalid items in AI benchmarks undermines evaluation reliability, yet manual identification is prohibitively costly and unscalable. Method: We propose an automated detection framework grounded in a unidimensional latent variable assumption, jointly modeling item difficulty and model response patterns. It integrates statistical anomaly detection with LLM-based adjudication for initial screening, enabling systematic identification of items exhibiting deviant performance. The approach requires no human annotation and is highly scalable. Contribution/Results: Evaluated across nine major benchmarks—including MMLU, BBH, and GSM8K—the method achieves an average item-level detection accuracy of 84%. It substantially reduces expert review burden while introducing psychometric principles to AI benchmark diagnostics for the first time. Crucially, it enables large-scale, response-pattern-driven automatic discovery and localization of invalid items—marking the first such methodology in the field.

Technology Category

Application Category

📝 Abstract

Benchmarks are pivotal in driving AI progress, and invalid benchmark questions frequently undermine their reliability. Manually identifying and correcting errors among thousands of benchmark questions is not only infeasible but also a critical bottleneck for reliable evaluation. In this work, we introduce a framework for systematic benchmark revision that leverages statistical analysis of response patterns to flag potentially invalid questions for further expert review. Our approach builds on a core assumption commonly used in AI evaluations that the mean score sufficiently summarizes model performance. This implies a unidimensional latent construct underlying the measurement experiment, yielding expected ranges for various statistics for each item. When empirically estimated values for these statistics fall outside the expected range for an item, the item is more likely to be problematic. Across nine widely used benchmarks, our method guides expert review to identify problematic questions with up to 84% precision. In addition, we introduce an LLM-judge first pass to review questions, further reducing human effort. Together, these components provide an efficient and scalable framework for systematic benchmark revision.

Problem

Research questions and friction points this paper is trying to address.

Identifying invalid benchmark questions that undermine AI evaluation reliability

Automating error detection in benchmarks using statistical response pattern analysis

Reducing human effort in benchmark validation through systematic revision framework

Innovation

Methods, ideas, or system contributions that make the work stand out.

Statistical analysis of response patterns flags invalid questions

LLM-judge first pass reduces human review effort

Expected versus empirical statistic ranges identify problematic items

🔎 Similar Papers

No similar papers found.

Apple

Seattle, United States of America

Data Scientist, Evaluations - Meta Superintelligence Labs