Establishing Best Practices for Building Rigorous Agentic Benchmarks

πŸ“… 2025-07-03
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Current AI agent benchmarks suffer from ill-posed task formulations and biased reward mechanisms, leading to performance estimation errors of up to 100% and severely compromising evaluation validity and reliability. To address this, we propose the Agentic Benchmark Checklist (ABC)β€”the first standardized, empirically grounded framework that systematically integrates benchmark construction expertise and best practices. Through case studies, defect diagnosis, and empirical validation, ABC identifies and rectifies critical design flaws in mainstream benchmarks such as CVE-Bench. Under ABC-guided revision, CVE-Bench’s performance overestimation rate decreases by 33%, markedly enhancing evaluation rigor and cross-benchmark comparability. This work establishes a reusable methodological foundation and practical standard for AI agent evaluation, advancing both benchmarking science and trustworthy agent assessment.

Technology Category

Application Category

πŸ“ Abstract
Benchmarks are essential for quantitatively tracking progress in AI. As AI agents become increasingly capable, researchers and practitioners have introduced agentic benchmarks to evaluate agents on complex, real-world tasks. These benchmarks typically measure agent capabilities by evaluating task outcomes via specific reward designs. However, we show that many agentic benchmarks have issues task setup or reward design. For example, SWE-bench Verified uses insufficient test cases, while TAU-bench counts empty responses as successful. Such issues can lead to under- or overestimation agents' performance by up to 100% in relative terms. To make agentic evaluation rigorous, we introduce the Agentic Benchmark Checklist (ABC), a set of guidelines that we synthesized from our benchmark-building experience, a survey of best practices, and previously reported issues. When applied to CVE-Bench, a benchmark with a particularly complex evaluation design, ABC reduces the performance overestimation by 33%.
Problem

Research questions and friction points this paper is trying to address.

Identifies flaws in agentic benchmark task setups
Addresses reward design issues causing performance misestimation
Proposes guidelines for rigorous agent evaluation benchmarks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces Agentic Benchmark Checklist (ABC)
Reduces performance overestimation by 33%
Synthesizes guidelines from benchmark-building experience
Yuxuan Zhu
Yuxuan Zhu
PhD student, University of Illinois Urbana-Champaign
Data systemsAI evaluation
Tengjun Jin
Tengjun Jin
University of Illinois at Urbana-Champaign
Yada Pruksachatkun
Yada Pruksachatkun
New York University
A
Andy Zhang
Stanford University
S
Shu Liu
University of California, Berkeley
S
Sasha Cui
Yale University
Sayash Kapoor
Sayash Kapoor
CS PhD, Princeton University
ReproducibilityAI agentsSocietal impacts
Shayne Longpre
Shayne Longpre
MIT, Stanford, Apple
Deep LearningNatural Language Understanding
Kevin Meng
Kevin Meng
Transluce
R
Rebecca Weiss
ML Commons
Fazl Barez
Fazl Barez
University of Oxford
AI SafetyExplainabilityInterpretabilityAI Governance and Policy
R
Rahul Gupta
Amazon
Jwala Dhamala
Jwala Dhamala
Amazon AGI
Large Language ModelsNatural Language ProcessingResponsible AI
J
Jacob Merizian
UK AISI
Mario Giulianelli
Mario Giulianelli
Associate Professor, UCL
Computational LinguisticsLanguage ModellingAI Evaluation
Harry Coppock
Harry Coppock
Imperial College London
Deep LearningSignal ProcessingAudioRepresentation LearningQuantisation
Cozmin Ududec
Cozmin Ududec
UK AI Security Institute
Quantum MechanicsMachine LearningLLM capabilities
Jasjeet Sekhon
Jasjeet Sekhon
Eugene Meyer Professor of Data Science, Political Science, and Statistics, Yale University
Causal InferenceMachine LearningStatisticsSocial Science
Jacob Steinhardt
Jacob Steinhardt
Stanford University
Machine learningStatistics
A
Antony Kellerman
UIUC
Sarah Schwettmann
Sarah Schwettmann
MIT
Cognitive scienceMachine LearningComputer VisionArtificial IntelligenceComputational Neuroscience
Matei Zaharia
Matei Zaharia
UC Berkeley and Databricks
Distributed SystemsMachine LearningDatabasesSecurity
Ion Stoica
Ion Stoica
Professor of Computer Science, UC Berkeley
Cloud ComputingNetworkingDistributed SystemsBig Data
Percy Liang
Percy Liang
Associate Professor of Computer Science, Stanford University
machine learningnatural language processing
Daniel Kang
Daniel Kang
UIUC
Computer Science