The Amazing Agent Race: Strong Tool Users, Weak Navigators

📅 2026-04-11
📈 Citations: 0
Influential: 0
📄 PDF

career value

219K/year
🤖 AI Summary
Current evaluations of large language model agents primarily rely on linear tasks, which fail to expose their limitations in planning and navigation within complex, nonlinear scenarios. To address this gap, this work proposes a new benchmark, The Amazing Agent Race (AAR), which introduces directed acyclic graph (DAG)-structured tasks requiring agents to perform multi-step navigation through Wikipedia, invoke chains of tools, and synthesize results. The benchmark programmatically generates 1,400 task instances, encompassing both sequential and combinatorial variants, and incorporates diagnostic metrics across three dimensions: navigation, tool invocation, and arithmetic reasoning. Evaluation reveals that even the best-performing agent achieves only 37.2% accuracy, with navigation errors accounting for 27%–52% of failures—substantially exceeding tool-related errors (<17%). These findings highlight “strong tools, weak navigation” as a critical bottleneck, and demonstrate that architectural design influences performance as significantly as model scale.

Technology Category

Application Category

📝 Abstract
Existing tool-use benchmarks for LLM agents are overwhelmingly linear: our analysis of six benchmarks shows 55 to 100% of instances are simple chains of 2 to 5 steps. We introduce The Amazing Agent Race (AAR), a benchmark featuring directed acyclic graph (DAG) puzzles (or"legs") with fork-merge tool chains. We release 1,400 instances across two variants: sequential (800 legs) and compositional (600 DAG legs). Agents must navigate Wikipedia, execute multi-step tool chains, and aggregate results into a verifiable answer. Legs are procedurally generated from Wikipedia seeds across four difficulty levels with live-API validation. Three complementary metrics (finish-line accuracy, pit-stop visit rate, and roadblock completion rate) separately diagnose navigation, tool-use, and arithmetic failures. Evaluating three agent frameworks on 1,400 legs, the best achieves only 37.2% accuracy. Navigation errors dominate (27 to 52% of trials) while tool-use errors remain below 17%, and agent architecture matters as much as model scale (Claude Code matches Codex CLI at 37% with 6x fewer tokens). The compositional structure of AAR reveals that agents fail not at calling tools but at navigating to the right pages, a blind spot invisible to linear benchmarks. The project page can be accessed at: https://minnesotanlp.github.io/the-amazing-agent-race
Problem

Research questions and friction points this paper is trying to address.

tool-use benchmark
navigation errors
directed acyclic graph
LLM agents
non-linear tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

tool-use benchmark
directed acyclic graph (DAG)
agent navigation
procedural task generation
multi-step reasoning
🔎 Similar Papers
No similar papers found.