🤖 AI Summary
Current evaluations of large language model agents primarily rely on linear tasks, which fail to expose their limitations in planning and navigation within complex, nonlinear scenarios. To address this gap, this work proposes a new benchmark, The Amazing Agent Race (AAR), which introduces directed acyclic graph (DAG)-structured tasks requiring agents to perform multi-step navigation through Wikipedia, invoke chains of tools, and synthesize results. The benchmark programmatically generates 1,400 task instances, encompassing both sequential and combinatorial variants, and incorporates diagnostic metrics across three dimensions: navigation, tool invocation, and arithmetic reasoning. Evaluation reveals that even the best-performing agent achieves only 37.2% accuracy, with navigation errors accounting for 27%–52% of failures—substantially exceeding tool-related errors (<17%). These findings highlight “strong tools, weak navigation” as a critical bottleneck, and demonstrate that architectural design influences performance as significantly as model scale.
📝 Abstract
Existing tool-use benchmarks for LLM agents are overwhelmingly linear: our analysis of six benchmarks shows 55 to 100% of instances are simple chains of 2 to 5 steps. We introduce The Amazing Agent Race (AAR), a benchmark featuring directed acyclic graph (DAG) puzzles (or"legs") with fork-merge tool chains. We release 1,400 instances across two variants: sequential (800 legs) and compositional (600 DAG legs). Agents must navigate Wikipedia, execute multi-step tool chains, and aggregate results into a verifiable answer. Legs are procedurally generated from Wikipedia seeds across four difficulty levels with live-API validation. Three complementary metrics (finish-line accuracy, pit-stop visit rate, and roadblock completion rate) separately diagnose navigation, tool-use, and arithmetic failures. Evaluating three agent frameworks on 1,400 legs, the best achieves only 37.2% accuracy. Navigation errors dominate (27 to 52% of trials) while tool-use errors remain below 17%, and agent architecture matters as much as model scale (Claude Code matches Codex CLI at 37% with 6x fewer tokens). The compositional structure of AAR reveals that agents fail not at calling tools but at navigating to the right pages, a blind spot invisible to linear benchmarks. The project page can be accessed at: https://minnesotanlp.github.io/the-amazing-agent-race