AgencyBench: Benchmarking the Frontiers of Autonomous Agents in 1M-Token Real-World Contexts

📅 2026-01-16

📈 Citations: 3

✨ Influential: 0

career value

218K/year

🤖 AI Summary

This work addresses the limitations of existing agent benchmarks, which often focus on isolated capabilities and struggle to evaluate long-horizon, high-complexity real-world tasks due to reliance on manual feedback that hinders scalability. The authors propose the first comprehensive, automated benchmark tailored to everyday AI usage scenarios, encompassing 32 real-world settings and 138 tasks—each requiring an average of 90 tool invocations and processing over one million tokens. The framework employs user-simulation agents for iterative feedback, Docker-based sandboxing for visual and functional rule validation, and a standardized task interface enabling unified closed-loop evaluation of both open- and closed-source models. Experimental results demonstrate a significant performance gap favoring closed-source models (48.4% vs. 32.1%) and highlight the critical role of co-optimizing models with agent frameworks to enhance overall effectiveness.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) based autonomous agents demonstrate multifaceted capabilities to contribute substantially to economic production. However, existing benchmarks remain focused on single agentic capability, failing to capture long-horizon real-world scenarios. Moreover, the reliance on human-in-the-loop feedback for realistic tasks creates a scalability bottleneck, hindering automated rollout collection and evaluation. To bridge this gap, we introduce AgencyBench, a comprehensive benchmark derived from daily AI usage, evaluating 6 core agentic capabilities across 32 real-world scenarios, comprising 138 tasks with specific queries, deliverables, and rubrics. These scenarios require an average of 90 tool calls, 1 million tokens, and hours of execution time to resolve. To enable automated evaluation, we employ a user simulation agent to provide iterative feedback, and a Docker sandbox to conduct visual and functional rubric-based assessment. Experiments reveal that closed-source models significantly outperform open-source models (48.4% vs 32.1%). Further analysis reveals significant disparities across models in resource efficiency, feedback-driven self-correction, and specific tool-use preferences. Finally, we investigate the impact of agentic scaffolds, observing that proprietary models demonstrate superior performance within their native ecosystems (e.g., Claude-4.5-Opus via Claude-Agent-SDK), while open-source models exhibit distinct performance peaks, suggesting potential optimization for specific execution frameworks. AgencyBench serves as a critical testbed for next-generation agents, highlighting the necessity of co-optimizing model architecture with agentic frameworks. We believe this work sheds light on the future direction of autonomous agents, and we release the full benchmark and evaluation toolkit at https://github.com/GAIR-NLP/AgencyBench.

Problem

Research questions and friction points this paper is trying to address.

autonomous agents

benchmarking

real-world scenarios

long-horizon tasks

automated evaluation

Innovation

Methods, ideas, or system contributions that make the work stand out.

autonomous agents

large language models

benchmarking