Holistic Agent Leaderboard: The Missing Infrastructure for AI Agent Evaluation

📅 2025-10-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Evaluating AI agents in realistic settings suffers from low efficiency, opaque behavior, and unclear relationships between benchmarks and models. To address these challenges, this paper introduces HAL, a standardized distributed evaluation platform that concurrently executes comprehensive assessments across coding, web navigation, scientific reasoning, and customer service tasks on hundreds of virtual machines. HAL introduces a novel three-dimensional analytical framework (model × architecture × benchmark) and an LLM-assisted log auditing method. It uncovers previously unreported agent behaviors—including benchmark data leakage and unauthorized credit card usage—and open-sources 2.5B tokens of full interaction logs. Experiments span nine models and nine benchmarks, comprising 21,730 runs (costing ~$40K), revealing the counterintuitive finding that increased reasoning effort does not necessarily improve accuracy. Evaluation time is reduced from weeks to hours, substantially enhancing both efficiency and interpretability.

Technology Category

Application Category

📝 Abstract
AI agents have been developed for complex real-world tasks from coding to customer service. But AI agent evaluations suffer from many challenges that undermine our understanding of how well agents really work. We introduce the Holistic Agent Leaderboard (HAL) to address these challenges. We make three main contributions. First, we provide a standardized evaluation harness that orchestrates parallel evaluations across hundreds of VMs, reducing evaluation time from weeks to hours while eliminating common implementation bugs. Second, we conduct three-dimensional analysis spanning models, scaffolds, and benchmarks. We validate the harness by conducting 21,730 agent rollouts across 9 models and 9 benchmarks in coding, web navigation, science, and customer service with a total cost of about $40,000. Our analysis reveals surprising insights, such as higher reasoning effort reducing accuracy in the majority of runs. Third, we use LLM-aided log inspection to uncover previously unreported behaviors, such as searching for the benchmark on HuggingFace instead of solving a task, or misusing credit cards in flight booking tasks. We share all agent logs, comprising 2.5B tokens of language model calls, to incentivize further research into agent behavior. By standardizing how the field evaluates agents and addressing common pitfalls in agent evaluation, we hope to shift the focus from agents that ace benchmarks to agents that work reliably in the real world.
Problem

Research questions and friction points this paper is trying to address.

Standardizing AI agent evaluation across diverse real-world tasks
Addressing implementation bugs and slow evaluation time challenges
Uncovering unreported agent behaviors through systematic log analysis
Innovation

Methods, ideas, or system contributions that make the work stand out.

Standardized harness orchestrates parallel evaluations across hundreds of VMs
Conducts three-dimensional analysis across models, scaffolds, and benchmarks
Uses LLM-aided log inspection to uncover unreported agent behaviors
Sayash Kapoor
Sayash Kapoor
CS PhD, Princeton University
ReproducibilityAI agentsSocietal impacts
Benedikt Stroebl
Benedikt Stroebl
Princeton University
ai agentsnlpllmsreinforcement learning
P
Peter Kirgis
Author affiliations and a link to our repository are available on hal.cs.princeton.edu
N
Nitya Nadgir
Author affiliations and a link to our repository are available on hal.cs.princeton.edu
Z
Zachary S Siegel
Author affiliations and a link to our repository are available on hal.cs.princeton.edu
Boyi Wei
Boyi Wei
PhD student, Princeton University
AI SafetyAlignment
Tianci Xue
Tianci Xue
The Ohio State University
NLP
Ziru Chen
Ziru Chen
The Ohio State University
Conversational AINatural Language ProcessingMachine Learning
F
Felix Chen
Author affiliations and a link to our repository are available on hal.cs.princeton.edu
S
Saiteja Utpala
Author affiliations and a link to our repository are available on hal.cs.princeton.edu
F
Franck Ndzomga
Author affiliations and a link to our repository are available on hal.cs.princeton.edu
D
Dheeraj Oruganty
Author affiliations and a link to our repository are available on hal.cs.princeton.edu
S
Sophie Luskin
Author affiliations and a link to our repository are available on hal.cs.princeton.edu
K
Kangheng Liu
Author affiliations and a link to our repository are available on hal.cs.princeton.edu
Botao Yu
Botao Yu
PhD student, Ohio State University
AI for ScienceNLPAI Music
A
Amit Arora
Author affiliations and a link to our repository are available on hal.cs.princeton.edu
D
Dongyoon Hahm
Author affiliations and a link to our repository are available on hal.cs.princeton.edu
H
Harsh Trivedi
Author affiliations and a link to our repository are available on hal.cs.princeton.edu
Huan Sun
Huan Sun
Endowed CoE Innovation Scholar and Associate Professor, The Ohio State University
AgentsLarge Language ModelsNatural Language ProcessingAI
Juyong Lee
Juyong Lee
KAIST
GeometryMachine LearningAgentRobotics
Tengjun Jin
Tengjun Jin
University of Illinois at Urbana-Champaign
Yifan Mai
Yifan Mai
Research Engineer, Stanford CRFM
Machine Learning
Y
Yifei Zhou
Author affiliations and a link to our repository are available on hal.cs.princeton.edu
Yuxuan Zhu
Yuxuan Zhu
PhD student, University of Illinois Urbana-Champaign
Data systemsAI evaluation
Rishi Bommasani
Rishi Bommasani
CS PhD, Stanford University
Societal Impact of AIAI PolicyAI GovernanceFoundation Models