Scaling Test-time Compute for LLM Agents

📅 2025-06-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large language model (LLM) agents often exhibit insufficient reasoning depth and poor decision robustness on complex tasks. Method: We propose a “timely reflection” mechanism that integrates parallel sampling, sequential revision, multi-level verification, and list-wise result fusion, augmented by a rollout diversity control strategy. Contribution/Results: This work achieves the first scalable, structured test-time compute expansion for LLM agents. We empirically establish a stable positive correlation between computational budget and performance; list-wise fusion consistently outperforms alternative aggregation methods; and diverse trajectory generation yields consistent performance gains. Evaluated across multiple reasoning and tool-use benchmarks, our approach delivers scalable, robust, and fine-tuning-free improvements—demonstrating that principled test-time computation expansion is a viable pathway to enhancing agent intelligence.

Technology Category

Application Category

📝 Abstract
Scaling test time compute has shown remarkable success in improving the reasoning abilities of large language models (LLMs). In this work, we conduct the first systematic exploration of applying test-time scaling methods to language agents and investigate the extent to which it improves their effectiveness. Specifically, we explore different test-time scaling strategies, including: (1) parallel sampling algorithms; (2) sequential revision strategies; (3) verifiers and merging methods; (4)strategies for diversifying rollouts.We carefully analyze and ablate the impact of different design strategies on applying test-time scaling on language agents, and have follow findings: 1. Scaling test time compute could improve the performance of agents. 2. Knowing when to reflect is important for agents. 3. Among different verification and result merging approaches, the list-wise method performs best. 4. Increasing diversified rollouts exerts a positive effect on the agent's task performance.
Problem

Research questions and friction points this paper is trying to address.

Exploring test-time scaling to enhance LLM agents' reasoning
Evaluating strategies like parallel sampling and sequential revision
Analyzing verification methods and rollout diversity for agent performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Parallel sampling algorithms for test-time scaling
Sequential revision strategies for LLM agents
List-wise verification and merging methods
🔎 Similar Papers
No similar papers found.
K
King Zhu
OPPO AI Agent Team
Hanhao Li
Hanhao Li
香港中文大学
Siwei Wu
Siwei Wu
University of Manchester
Large Language ModelsNatural Language ProcessingCommonsense Reasoning
T
Tianshun Xing
OPPO AI Agent Team
D
Dehua Ma
OPPO AI Agent Team
X
Xiangru Tang
OPPO AI Agent Team
M
Minghao Liu
OPPO AI Agent Team
J
Jian Yang
OPPO AI Agent Team
J
Jiaheng Liu
OPPO AI Agent Team
Yuchen Eleanor Jiang
Yuchen Eleanor Jiang
OPPO
natural language processingmachine learning
C
Changwang Zhang
OPPO AI Agent Team
Chenghua Lin
Chenghua Lin
Professor of Natural Language Processing, University of Manchester
Natural language processingnatural language generationmachine learning
J
Jun Wang
OPPO AI Agent Team
G
Ge Zhang
OPPO AI Agent Team
Wangchunshu Zhou
Wangchunshu Zhou
OPPO & M-A-P
artificial general intelligencelanguage agentslarge language modelsnatural language processing