Benchmark Test-Time Scaling of General LLM Agents

📅 2026-02-21

📈 Citations: 0

✨ Influential: 0

career value

219K/year

🤖 AI Summary

Existing evaluation benchmarks are often confined to specific domains, making it difficult to comprehensively assess the general capabilities of large language model (LLM) agents in unified environments requiring multiple skills and tools. To address this limitation, this work proposes General AgentBench—the first unified evaluation framework for general-purpose LLM agents—systematically examining their test-time scaling behavior across diverse tasks involving search, coding, reasoning, and tool use, under both sequential and parallel strategies. The framework establishes an end-to-end evaluation pipeline through multi-domain task integration, sequential interactive iteration, and multi-trajectory parallel sampling. Experiments reveal that mainstream LLM agents suffer significant performance degradation when transitioning from specialized to general settings, and current test-time scaling approaches are hindered by context-length constraints in sequential methods or validation gaps in parallel ones, limiting their effectiveness in practice.

Technology Category

Application Category

📝 Abstract

LLM agents are increasingly expected to function as general-purpose systems capable of resolving open-ended user requests. While existing benchmarks focus on domain-aware environments for developing specialized agents, evaluating general-purpose agents requires more realistic settings that challenge them to operate across multiple skills and tools within a unified environment. We introduce General AgentBench, a benchmark that provides such a unified framework for evaluating general LLM agents across search, coding, reasoning, and tool-use domains. Using General AgentBench, we systematically study test-time scaling behaviors under sequential scaling (iterative interaction) and parallel scaling (sampling multiple trajectories). Evaluation of ten leading LLM agents reveals a substantial performance degradation when moving from domain-specific evaluations to this general-agent setting. Moreover, we find that neither scaling methodology yields effective performance improvements in practice, due to two fundamental limitations: context ceiling in sequential scaling and verification gap in parallel scaling. Code is publicly available at https://github.com/cxcscmu/General-AgentBench.

Problem

Research questions and friction points this paper is trying to address.

general LLM agents

benchmark

test-time scaling

tool-use

open-ended tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

General AgentBench

test-time scaling

LLM agents