๐ค AI Summary
To address the high computational cost of online policy rollout and limited reasoning-path exploration in reinforcement learningโbased alignment of large language models (LLMs), this paper proposes TreePO: a tree-structured search framework for sequence generation that integrates dynamic tree sampling with fixed-length segment decoding, guided by local uncertainty for branch expansion. Key innovations include (i) segmented sampling with early termination to alleviate KV cache pressure; (ii) tree-structured segment-wise advantage estimation, unifying global policy optimization and local credit assignment; and (iii) a dynamic divergence-and-backtracking strategy grounded in both token-level probability and output quality. On multiple reasoning benchmarks, TreePO reduces GPU time during sampling by 22%โ43%, decreases trajectory-level computation by 40%, and lowers token-level computation by 35%, significantly improving training efficiency and scalability.
๐ Abstract
Recent advancements in aligning large language models via reinforcement learning have achieved remarkable gains in solving complex reasoning problems, but at the cost of expensive on-policy rollouts and limited exploration of diverse reasoning paths. In this work, we introduce TreePO, involving a self-guided rollout algorithm that views sequence generation as a tree-structured searching process. Composed of dynamic tree sampling policy and fixed-length segment decoding, TreePO leverages local uncertainty to warrant additional branches. By amortizing computation across common prefixes and pruning low-value paths early, TreePO essentially reduces the per-update compute burden while preserving or enhancing exploration diversity. Key contributions include: (1) a segment-wise sampling algorithm that alleviates the KV cache burden through contiguous segments and spawns new branches along with an early-stop mechanism; (2) a tree-based segment-level advantage estimation that considers both global and local proximal policy optimization. and (3) analysis on the effectiveness of probability and quality-driven dynamic divergence and fallback strategy. We empirically validate the performance gain of TreePO on a set reasoning benchmarks and the efficiency saving of GPU hours from 22% up to 43% of the sampling design for the trained models, meanwhile showing up to 40% reduction at trajectory-level and 35% at token-level sampling compute for the existing models. While offering a free lunch of inference efficiency, TreePO reveals a practical path toward scaling RL-based post-training with fewer samples and less compute. Home page locates at https://m-a-p.ai/TreePO.