AgencyBench: Benchmarking the Frontiers of Autonomous Agents in 1M-Token Real-World Contexts

πŸ“… 2026-01-16
πŸ“ˆ Citations: 3
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the limitations of existing agent benchmarks, which often focus on isolated capabilities and struggle to evaluate long-horizon, high-complexity real-world tasks due to reliance on manual feedback that hinders scalability. The authors propose the first comprehensive, automated benchmark tailored to everyday AI usage scenarios, encompassing 32 real-world settings and 138 tasksβ€”each requiring an average of 90 tool invocations and processing over one million tokens. The framework employs user-simulation agents for iterative feedback, Docker-based sandboxing for visual and functional rule validation, and a standardized task interface enabling unified closed-loop evaluation of both open- and closed-source models. Experimental results demonstrate a significant performance gap favoring closed-source models (48.4% vs. 32.1%) and highlight the critical role of co-optimizing models with agent frameworks to enhance overall effectiveness.

Technology Category

Application Category

πŸ“ Abstract
Large Language Models (LLMs) based autonomous agents demonstrate multifaceted capabilities to contribute substantially to economic production. However, existing benchmarks remain focused on single agentic capability, failing to capture long-horizon real-world scenarios. Moreover, the reliance on human-in-the-loop feedback for realistic tasks creates a scalability bottleneck, hindering automated rollout collection and evaluation. To bridge this gap, we introduce AgencyBench, a comprehensive benchmark derived from daily AI usage, evaluating 6 core agentic capabilities across 32 real-world scenarios, comprising 138 tasks with specific queries, deliverables, and rubrics. These scenarios require an average of 90 tool calls, 1 million tokens, and hours of execution time to resolve. To enable automated evaluation, we employ a user simulation agent to provide iterative feedback, and a Docker sandbox to conduct visual and functional rubric-based assessment. Experiments reveal that closed-source models significantly outperform open-source models (48.4% vs 32.1%). Further analysis reveals significant disparities across models in resource efficiency, feedback-driven self-correction, and specific tool-use preferences. Finally, we investigate the impact of agentic scaffolds, observing that proprietary models demonstrate superior performance within their native ecosystems (e.g., Claude-4.5-Opus via Claude-Agent-SDK), while open-source models exhibit distinct performance peaks, suggesting potential optimization for specific execution frameworks. AgencyBench serves as a critical testbed for next-generation agents, highlighting the necessity of co-optimizing model architecture with agentic frameworks. We believe this work sheds light on the future direction of autonomous agents, and we release the full benchmark and evaluation toolkit at https://github.com/GAIR-NLP/AgencyBench.
Problem

Research questions and friction points this paper is trying to address.

autonomous agents
benchmarking
real-world scenarios
long-horizon tasks
automated evaluation
Innovation

Methods, ideas, or system contributions that make the work stand out.

autonomous agents
large language models
benchmarking
user simulation
Docker sandbox
πŸ”Ž Similar Papers
No similar papers found.