The Tool Decathlon: Benchmarking Language Agents for Diverse, Realistic, and Long-Horizon Task Execution

📅 2025-10-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing language agent benchmarks are largely confined to narrow, simplified tasks, lacking diversity, real-world environmental fidelity, and long-horizon interactive capabilities—thus failing to reflect practical deployment performance. Method: We introduce Toolathlon, the first benchmark for language agents designed explicitly for realistic, multi-application scenarios. It encompasses 32 authentic software platforms (e.g., Google Calendar, Notion, Kubernetes, BigQuery) and 604 executable tools, enabling 108 cross-application, long-horizon tasks. Toolathlon leverages the Model Context Protocol (MCP) for high-fidelity tool interfacing, employs real-world state initialization, and implements verifiable, automated evaluation. Contribution/Results: Empirical evaluation reveals a substantial capability gap: the strongest closed-source model, Claude-4.5-Sonnet, achieves only 38.6% task success rate; the top open-source model, DeepSeek-V3.2-Exp, attains just 20.1%. These results underscore the significant challenges language agents face in real-world, production-grade settings.

Technology Category

Application Category

📝 Abstract
Real-world language agents must handle complex, multi-step workflows across diverse Apps. For instance, an agent may manage emails by coordinating with calendars and file systems, or monitor a production database to detect anomalies and generate reports following an operating manual. However, existing language agent benchmarks often focus on narrow domains or simplified tasks that lack the diversity, realism, and long-horizon complexity required to evaluate agents' real-world performance. To address this gap, we introduce the Tool Decathlon (dubbed as Toolathlon), a benchmark for language agents offering diverse Apps and tools, realistic environment setup, and reliable execution-based evaluation. Toolathlon spans 32 software applications and 604 tools, ranging from everyday platforms such as Google Calendar and Notion to professional ones like WooCommerce, Kubernetes, and BigQuery. Most of the tools are based on a high-quality set of Model Context Protocol (MCP) servers that we may have revised or implemented ourselves. Unlike prior works, which primarily ensure functional realism but offer limited environment state diversity, we provide realistic initial environment states from real software, such as Canvas courses with dozens of students or real financial spreadsheets. This benchmark includes 108 manually sourced or crafted tasks in total, requiring interacting with multiple Apps over around 20 turns on average to complete. Each task is strictly verifiable through dedicated evaluation scripts. Comprehensive evaluation of SOTA models highlights their significant shortcomings: the best-performing model, Claude-4.5-Sonnet, achieves only a 38.6% success rate with 20.2 tool calling turns on average, while the top open-weights model DeepSeek-V3.2-Exp reaches 20.1%. We expect Toolathlon to drive the development of more capable language agents for real-world, long-horizon task execution.
Problem

Research questions and friction points this paper is trying to address.

Benchmarks lack diversity, realism, and long-horizon task complexity
Evaluating language agents for real-world multi-step workflows across Apps
Addressing gaps in agent performance on realistic software environments
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces Tool Decathlon benchmark for diverse realistic agent evaluation
Uses Model Context Protocol servers for 604 tools across 32 applications
Provides realistic initial environment states from real software systems
🔎 Similar Papers
No similar papers found.
J
Junlong Li
The Hong Kong University of Science and Technology
W
Wenshuo Zhao
The Hong Kong University of Science and Technology
J
Jian Zhao
The Hong Kong University of Science and Technology
Weihao Zeng
Weihao Zeng
Hong Kong University of Science and Technology
LLM ReasoningAlignment
H
Haoze Wu
The Hong Kong University of Science and Technology
X
Xiaochen Wang
The Hong Kong University of Science and Technology
Rui Ge
Rui Ge
Shanghai Jiao Tong University
Yuxuan Cao
Yuxuan Cao
Hong Kong University of Science and Technology
data miningllmllm reasoning
Y
Yuzhen Huang
The Hong Kong University of Science and Technology
W
Wei Liu
The Hong Kong University of Science and Technology
Junteng Liu
Junteng Liu
Hong Kong University of Science and Technology
Machine LearningNature Language Processing
Zhaochen Su
Zhaochen Su
Hong Kong University of Science and Technology
AI/LLM/LVLMAgentReasoning
Y
Yiyang Guo
The Hong Kong University of Science and Technology
F
Fan Zhou
The Hong Kong University of Science and Technology
L
Lueyang Zhang
The Hong Kong University of Science and Technology
J
Juan Michelini
All Hands AI
Xingyao Wang
Xingyao Wang
All Hands AI, University of Illinois Urbana-Champaign
Xiang Yue
Xiang Yue
Carnegie Mellon University
Natural Language ProcessingLarge Language ModelsMachine Learning
Shuyan Zhou
Shuyan Zhou
Duke University
Large Language ModelsAI Agent
Graham Neubig
Graham Neubig
Carnegie Mellon University, All Hands AI
Natural Language ProcessingMachine LearningArtificial Intelligence
Junxian He
Junxian He
Hong Kong University of Science and Technology
Machine LearningNatural Language Processing