Benchmarks Saturate When The Model Gets Smarter Than The Judge

📅 2026-01-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing benchmarks for large language models often fail due to data noise and inadequate rater competence, particularly when model performance approaches or exceeds that of the raters, leading to benchmark saturation. This work addresses these limitations by constructing Omni-MATH-2, a high-quality mathematical evaluation dataset comprising 4,181 cleaned problems and 247 annotated non-standard problems, enhanced through expert review, LaTeX compilability verification, and multi-dimensional problem annotations. The study further quantifies rater errors—revealing, for instance, that Omni-Judge errs in 96.4% of disputed cases—as the primary cause of benchmark failure. By establishing a new paradigm for benchmark construction that jointly prioritizes data quality and rater reliability, the research demonstrates that merely improving model capabilities cannot overcome the constraints imposed by low-quality evaluation frameworks.

Technology Category

Application Category

📝 Abstract
Benchmarks are important tools to track progress in the development of Large Language Models (LLMs), yet inaccuracies in datasets and evaluation methods consistently undermine their effectiveness. Here, we present Omni-MATH-2, a manually revised version of the Omni-MATH dataset comprising a clean, exact-answer subset ($n{=}4181$) and a tagged, non-standard subset ($n{=}247$). Each problem was audited to ensure LaTeX compilability, solvability and verifiability, which involved adding missing figures or information, labeling problems requiring a proof, estimation or image, and removing clutter. This process significantly reduces dataset-induced noise, thereby providing a more precise assessment of model performance. The annotated dataset also allows us to evaluate judge-induced noise by comparing GPT-5 mini with the original Omni-Judge, revealing substantial discrepancies between judges on both the clean and tagged problem subsets. Expert annotations reveal that Omni-Judge is wrong in $96.4\%$ of the judge disagreements, indicating its inability to differentiate between models'abilities, even well before saturation of the benchmark occurs. As problems become more challenging, we find that increasingly competent judges become essential in order to prevent judge errors from masking genuine differences between models. Finally, neither judge identifies the present failure modes for the subset of tagged problems, demonstrating that dataset quality and judge reliability are both critical to develop accurate benchmarks of model performance.
Problem

Research questions and friction points this paper is trying to address.

benchmark saturation
judge reliability
dataset quality
large language models
evaluation accuracy
Innovation

Methods, ideas, or system contributions that make the work stand out.

benchmark reliability
judge-induced noise
dataset curation
LLM evaluation
Omni-MATH-2
🔎 Similar Papers
No similar papers found.
M
Marthe Ballon
Data Analytics Lab, Vrije Universiteit Brussel, Pleinlaan 5, 1050 Brussel, Belgium; imec-SMIT, Vrije Universiteit Brussel, Pleinlaan 9, 1050 Brussels, Belgium
A
A. Algaba
Data Analytics Lab, Vrije Universiteit Brussel, Pleinlaan 5, 1050 Brussel, Belgium; imec-SMIT, Vrije Universiteit Brussel, Pleinlaan 9, 1050 Brussels, Belgium
B
Brecht Verbeken
Data Analytics Lab, Vrije Universiteit Brussel, Pleinlaan 5, 1050 Brussel, Belgium; imec-SMIT, Vrije Universiteit Brussel, Pleinlaan 9, 1050 Brussels, Belgium
Vincent Ginis
Vincent Ginis
Vrije Universiteit Brussel / Harvard University
Physics | Machine Learning