Tree Search for LLM Agent Reinforcement Learning

📅 2025-09-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenge of sparse outcome rewards and insufficient supervision signals in long-horizon, multi-step agent tasks, this paper proposes Tree-GRPO, a tree-search-based grouped reinforcement learning framework. It constructs a trajectory tree with shared prefixes to enable efficient sampling and fine-grained process-level supervision. Innovatively integrating intra-tree and inter-tree grouped relative policy optimization, Tree-GRPO implicitly decomposes sparse outcome rewards into step-level preference signals—provably equivalent to direct step-level preference learning. Coupled with process reward decomposition and multi-level advantage estimation, the framework significantly improves training stability and generalization. Extensive evaluation across 11 datasets and three question-answering task categories demonstrates consistent superiority over chain-based RL baselines, validating both the effectiveness and generality of tree-structured modeling for agent tasks.

Technology Category

Application Category

📝 Abstract
Recent advances in reinforcement learning (RL) have significantly enhanced the agentic capabilities of large language models (LLMs). In long-term and multi-turn agent tasks, existing approaches driven solely by outcome rewards often suffer from the problem of sparse supervision. To address the challenge, we propose Tree-based Group Relative Policy Optimization (Tree-GRPO), a grouped agent RL method based on tree search, where each tree node represents the complete agent interaction step. By sharing common prefixes, the tree search sampling increases the number of rollouts achievable within a fixed budget of tokens or tool calls. Moreover, we find that the tree-structured trajectory naturally allows the construction of step-wise process supervised signals even using only the outcome reward. Based on this, Tree-GRPO estimates the grouped relative advantages both on intra-tree and inter-tree levels. Through theoretical analysis, we demonstrate that the objective of intra-tree level group relative policy optimization is equivalent to that of step-level direct preference learning. Experiments across 11 datasets and 3 types of QA tasks demonstrate the superiority of the proposed tree-based RL over the chain-based RL method.
Problem

Research questions and friction points this paper is trying to address.

Addresses sparse supervision in long-term multi-turn agent tasks
Proposes tree search method to increase rollouts within token budget
Enables step-wise process supervision using only outcome rewards
Innovation

Methods, ideas, or system contributions that make the work stand out.

Tree-based grouped agent reinforcement learning method
Tree search sampling with shared common prefixes
Intra-tree and inter-tree grouped relative advantages optimization
Yuxiang Ji
Yuxiang Ji
Xiamen University
Z
Ziyu Ma
AMAP, Alibaba Group
Y
Yong Wang
AMAP, Alibaba Group
G
Guanhua Chen
Southern University of Science and Technology
X
Xiangxiang Chu
AMAP, Alibaba Group
L
Liaoni Wu
Xiamen University