🤖 AI Summary
Existing evaluation benchmarks are often confined to specific domains, making it difficult to comprehensively assess the general capabilities of large language model (LLM) agents in unified environments requiring multiple skills and tools. To address this limitation, this work proposes General AgentBench—the first unified evaluation framework for general-purpose LLM agents—systematically examining their test-time scaling behavior across diverse tasks involving search, coding, reasoning, and tool use, under both sequential and parallel strategies. The framework establishes an end-to-end evaluation pipeline through multi-domain task integration, sequential interactive iteration, and multi-trajectory parallel sampling. Experiments reveal that mainstream LLM agents suffer significant performance degradation when transitioning from specialized to general settings, and current test-time scaling approaches are hindered by context-length constraints in sequential methods or validation gaps in parallel ones, limiting their effectiveness in practice.
📝 Abstract
LLM agents are increasingly expected to function as general-purpose systems capable of resolving open-ended user requests. While existing benchmarks focus on domain-aware environments for developing specialized agents, evaluating general-purpose agents requires more realistic settings that challenge them to operate across multiple skills and tools within a unified environment. We introduce General AgentBench, a benchmark that provides such a unified framework for evaluating general LLM agents across search, coding, reasoning, and tool-use domains. Using General AgentBench, we systematically study test-time scaling behaviors under sequential scaling (iterative interaction) and parallel scaling (sampling multiple trajectories). Evaluation of ten leading LLM agents reveals a substantial performance degradation when moving from domain-specific evaluations to this general-agent setting. Moreover, we find that neither scaling methodology yields effective performance improvements in practice, due to two fundamental limitations: context ceiling in sequential scaling and verification gap in parallel scaling. Code is publicly available at https://github.com/cxcscmu/General-AgentBench.