๐ค AI Summary
Current AI benchmarks often suffer from implicit assumptions, ambiguous environment descriptions, and fragile scoring logic, undermining their reliability in evaluating model capabilities. This work proposes the Auto Benchmark Audit (ABA) frameworkโthe first large-scale automated auditing system for state-of-the-art LLMs and agent benchmarks. ABA integrates task execution, environment simulation, and logical verification to systematically uncover hidden dependencies, missing specifications, and scoring flaws, with findings validated through expert review and third-party reports. Applied across 168 diverse benchmarks, ABA identified critical issues in 25.7% of them. Excluding these flawed benchmarks led to performance improvements of 9.9% on SWE-bench Verified and 9.6% on Terminal-Bench 2, substantially correcting existing evaluation biases.
๐ Abstract
Modern AI benchmarks operate at a complexity that outpaces traditional verification methods. Tasks authored by domain experts often contain implicit assumptions, incomplete environment specifications, and brittle evaluation logic that human annotation cannot reliably catch. We introduce Auto Benchmark Audit (ABA), an agentic framework that systematically audits individual benchmark tasks, uncovering issues such as hidden environment dependencies, specification gaps, and limited grading logic. We run ABA on a collection of frontier LLM benchmarks and previous NeurIPS publications, totaling 168 benchmarks across nine domains. Across this corpus, ABA identifies critical issues including ambiguous task design, execution environment conflicts, and incorrect ground truths in over 25.7% of the evaluated tasks. The precision of these automated audits is validated by expert review and independent third-party reports such as upstream PRs. Crucially, we demonstrate that these problematic tasks severely distorts capability assessments for agents and LLMs: filtering out these tasks with issues shifts model rankings and increases average performance on SWE-bench Verified and Terminal-Bench 2 by 9.9% and 9.6%, respectively. We release the agentic tool and all task annotations to support the future development of frontier benchmarks.