TreeAdv: Tree-Structured Advantage Redistribution for Group-Based RL

📅 2026-01-07
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the inefficiency of existing population-based reinforcement learning methods, which flatten trajectory structures and consequently suffer from low sample efficiency and a bias toward overly long reasoning chains. To overcome this, the authors propose a tree-structured population rollout mechanism that explicitly models the hierarchical nature of trajectories: branching exploration occurs at high-uncertainty nodes, while low-uncertainty nodes share common prefixes, thereby constructing an efficient rollout forest. The method innovatively introduces tree-based advantage redistribution, propagating advantages from leaf nodes back to internal tokens, and integrates entropy-driven sampling, making it compatible with policy optimization objectives such as GRPO and GSPO. Evaluated on ten mathematical reasoning benchmarks under identical supervision, data, and decoding budgets, the approach significantly outperforms baseline methods while generating fewer tokens.

Technology Category

Application Category

📝 Abstract
Reinforcement learning with group-based objectives, such as Group Relative Policy Optimization (GRPO), is a common framework for aligning large language models on complex reasoning tasks. However, standard GRPO treats each rollout trajectory as an independent flat sequence and assigns a single sequence-level advantage to all tokens, which leads to sample inefficiency and a length bias toward verbose, redundant chains of thought without improving logical depth. We introduce TreeAdv (Tree-Structured Advantage Redistribution for Group-Based RL), which makes the tree structure of group rollouts explicit for both exploration and advantage assignment. Specifically, TreeAdv builds a group of trees (a forest) based on an entropy-driven sampling method where each tree branches at high-uncertainty decisions while sharing low-uncertainty tokens across rollouts. Then, TreeAdv aggregates token-level advantages for internal tree segments by redistributing the advantages of complete rollouts (all leaf nodes), and TreeAdv can easily apply to group-based objectives such as GRPO or GSPO. Across 10 math reasoning benchmarks, TreeAdv consistently outperforms GRPO and GSPO, while using substantially fewer generated tokens under identical supervision, data, and decoding budgets.
Problem

Research questions and friction points this paper is trying to address.

group-based reinforcement learning
advantage assignment
sample inefficiency
length bias
reasoning tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Tree-Structured Advantage Redistribution
Group-Based Reinforcement Learning
Entropy-Driven Sampling
Token-Level Advantage Aggregation
Rollout Forest
🔎 Similar Papers
No similar papers found.
Lang Cao
Lang Cao
CS PhD Student at University of Illinois Urbana-Champaign
Machine LearningMachine ReasoningAI for Health
H
Hui Ruan
Huawei Technologies Co., Ltd., China
Yongqian Li
Yongqian Li
Unknown affiliation
P
Peng Chao
Huawei Technologies Co., Ltd., China
W
Wu Ning
Huawei Technologies Co., Ltd., China
H
Haonan Song
Huawei Technologies Co., Ltd., China
R
Renhong Chen
Huawei Technologies Co., Ltd., China
Yitong Li
Yitong Li
Huawei Technologies Co., Ltd.
Natural Language ProcessingMachine Learning