🤖 AI Summary
Evaluating complex, multi-step agent behaviors—such as tool use and intermediate reasoning—is typically reliant on expert judgment, costly, and difficult to scale. This work proposes EvalAgent, the first systematic exploration of using large language models for automated agent evaluation. EvalAgent encodes domain-specific assessment knowledge into composable “evaluation skills” and constructs a trajectory-driven, end-to-end automated evaluation pipeline that integrates procedural instructions, reusable code templates, and dynamic API retrieval to generate comprehensive evaluation outputs, including metrics, executable code, and reports. The contributions include the evaluation skill mechanism, a meta-evaluation framework, the AgentEvalBench benchmark, and a new metric, Eval@1. Experiments show that EvalAgent improves Eval@1 success rates from 17.5% to 65% and achieves a 79.5% human expert preference rate, significantly outperforming baseline methods.
📝 Abstract
Agent evaluation requires assessing complex multi-step behaviors involving tool use and intermediate reasoning, making it costly and expertise-intensive. A natural question arises: can frontier coding assistants reliably automate this evaluation process? Our study shows that simply prompting coding assistants is insufficient for this task. Without domain-specific evaluation knowledge, frontier coding assistants achieve only a 30% execution success rate and produce over-engineered evaluations averaging 12+ metrics per agent, indicating that strong coding ability does not automatically translate to reliable agent evaluation. We introduce EvalAgent, an AI assistant that automates the end-to-end agent evaluation pipeline. EvalAgent encodes evaluation domain expertise as evaluation skills (procedural instructions, reusable code and templates, and dynamically retrieved API documentation) that compose into a trace-based pipeline producing complete evaluation artifacts including metrics, executable code, and reports. To systematically assess generated evaluations, we introduce a meta-evaluation framework alongside AgentEvalBench, a benchmark comprising 20 agents, each paired with evaluation requirements and test scenarios. We further propose the Eval@1 metric to measure whether generated evaluation code both executes and yields meaningful results on the first run. Our experiments show that EvalAgent produces focused evaluations, improving Eval@1 from 17.5% to 65%, and achieving 79.5% human expert preference over baseline approaches. Further ablation studies show that evaluation skills are critical for handling complex evaluation: removing them causes Eval@1 to drop significantly from 65% to 30%.