π€ AI Summary
This work addresses the limitations of existing agent benchmarks, which often focus on isolated capabilities and struggle to evaluate long-horizon, high-complexity real-world tasks due to reliance on manual feedback that hinders scalability. The authors propose the first comprehensive, automated benchmark tailored to everyday AI usage scenarios, encompassing 32 real-world settings and 138 tasksβeach requiring an average of 90 tool invocations and processing over one million tokens. The framework employs user-simulation agents for iterative feedback, Docker-based sandboxing for visual and functional rule validation, and a standardized task interface enabling unified closed-loop evaluation of both open- and closed-source models. Experimental results demonstrate a significant performance gap favoring closed-source models (48.4% vs. 32.1%) and highlight the critical role of co-optimizing models with agent frameworks to enhance overall effectiveness.
π Abstract
Large Language Models (LLMs) based autonomous agents demonstrate multifaceted capabilities to contribute substantially to economic production. However, existing benchmarks remain focused on single agentic capability, failing to capture long-horizon real-world scenarios. Moreover, the reliance on human-in-the-loop feedback for realistic tasks creates a scalability bottleneck, hindering automated rollout collection and evaluation. To bridge this gap, we introduce AgencyBench, a comprehensive benchmark derived from daily AI usage, evaluating 6 core agentic capabilities across 32 real-world scenarios, comprising 138 tasks with specific queries, deliverables, and rubrics. These scenarios require an average of 90 tool calls, 1 million tokens, and hours of execution time to resolve. To enable automated evaluation, we employ a user simulation agent to provide iterative feedback, and a Docker sandbox to conduct visual and functional rubric-based assessment. Experiments reveal that closed-source models significantly outperform open-source models (48.4% vs 32.1%). Further analysis reveals significant disparities across models in resource efficiency, feedback-driven self-correction, and specific tool-use preferences. Finally, we investigate the impact of agentic scaffolds, observing that proprietary models demonstrate superior performance within their native ecosystems (e.g., Claude-4.5-Opus via Claude-Agent-SDK), while open-source models exhibit distinct performance peaks, suggesting potential optimization for specific execution frameworks. AgencyBench serves as a critical testbed for next-generation agents, highlighting the necessity of co-optimizing model architecture with agentic frameworks. We believe this work sheds light on the future direction of autonomous agents, and we release the full benchmark and evaluation toolkit at https://github.com/GAIR-NLP/AgencyBench.