Benchmark Test-Time Scaling of General LLM Agents

📅 2026-02-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing evaluation benchmarks are often confined to specific domains, making it difficult to comprehensively assess the general capabilities of large language model (LLM) agents in unified environments requiring multiple skills and tools. To address this limitation, this work proposes General AgentBench—the first unified evaluation framework for general-purpose LLM agents—systematically examining their test-time scaling behavior across diverse tasks involving search, coding, reasoning, and tool use, under both sequential and parallel strategies. The framework establishes an end-to-end evaluation pipeline through multi-domain task integration, sequential interactive iteration, and multi-trajectory parallel sampling. Experiments reveal that mainstream LLM agents suffer significant performance degradation when transitioning from specialized to general settings, and current test-time scaling approaches are hindered by context-length constraints in sequential methods or validation gaps in parallel ones, limiting their effectiveness in practice.

Technology Category

Application Category

📝 Abstract
LLM agents are increasingly expected to function as general-purpose systems capable of resolving open-ended user requests. While existing benchmarks focus on domain-aware environments for developing specialized agents, evaluating general-purpose agents requires more realistic settings that challenge them to operate across multiple skills and tools within a unified environment. We introduce General AgentBench, a benchmark that provides such a unified framework for evaluating general LLM agents across search, coding, reasoning, and tool-use domains. Using General AgentBench, we systematically study test-time scaling behaviors under sequential scaling (iterative interaction) and parallel scaling (sampling multiple trajectories). Evaluation of ten leading LLM agents reveals a substantial performance degradation when moving from domain-specific evaluations to this general-agent setting. Moreover, we find that neither scaling methodology yields effective performance improvements in practice, due to two fundamental limitations: context ceiling in sequential scaling and verification gap in parallel scaling. Code is publicly available at https://github.com/cxcscmu/General-AgentBench.
Problem

Research questions and friction points this paper is trying to address.

general LLM agents
benchmark
test-time scaling
tool-use
open-ended tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

General AgentBench
test-time scaling
LLM agents
sequential scaling
parallel scaling
🔎 Similar Papers
No similar papers found.
Xiaochuan Li
Xiaochuan Li
Carnegie Mellon University
Machine LearningNatural Language Processing
R
Ryan Ming
Language Technologies Institute, School of Computer Science, Carnegie Mellon University
P
Pranav Setlur
Language Technologies Institute, School of Computer Science, Carnegie Mellon University
A
Abhijay Paladugu
Language Technologies Institute, School of Computer Science, Carnegie Mellon University
A
Andy Tang
Language Technologies Institute, School of Computer Science, Carnegie Mellon University
Hao Kang
Hao Kang
Carnegie Mellon University
S
Shuai Shao
Meta
R
Rong Jin
Meta
Chenyan Xiong
Chenyan Xiong
Associate Professor, Carnegie Mellon University
Information RetrievalLanguage ModelsNatural Language Understanding.