🤖 AI Summary
In the Reinforcement Learning with Verifiable Rewards (RLVR) framework, insufficient trajectory diversity during inter-group reasoning rollout leads to reward signal attenuation and inefficient policy learning. To address this, we propose an uncertainty-aware lookahead tree for rollout generation: at high-uncertainty decision points, the method actively branches to simulate multiple candidate trajectories, then applies similarity-driven pruning to suppress redundancy—explicitly enhancing trajectory-level diversity. Unlike conventional random sampling, this approach avoids local convergence and significantly improves exploration quality. Experiments show an average 131% acceleration in policy convergence and a 4.2% absolute improvement in final pass@1 performance, with consistent gains across diverse reasoning tasks. The core innovation lies in integrating a deterministic tree structure into the rollout process, enabling controllable, efficient, and diverse reasoning trajectory generation.
📝 Abstract
Reinforcement Learning with Verifiable Rewards (RLVR), particularly with algorithms like Group Relative Policy Optimization (GRPO), has proven highly effective in enhancing the reasoning capabilities of large language models. However, a critical bottleneck in current pipelines lies in the limited diversity of sampled trajectories during group rollouts. Homogeneous trajectories and their associated rewards would diminish the return signals for policy updates, thereby hindering effective policy learning. This lack of diversity stems primarily from token-level stochastic sampling, where local variations are likely to collapse into near-identical reasoning paths. To address this limitation, we propose Lookahead Tree-Based Rollouts (LATR), a novel rollout strategy designed to explicitly promotes trajectory-level diversity by enforcing branching into different candidate tokens likely to yield distinct continuations. Specifically, LATR iteratively operates in three stages: (1) branching at high-uncertainty generation steps, (2) performing lookahead simulation for each new branch, and (3) pruning branches that exhibits prolonged similarity during simulation. Compared with stochastic Sampling, LATR accelerates policy learning by 131% on average and improves final pass@1 performance by 4.2% on both GRPO and Dynamic sAmpling Policy Optimization (DAPO) algorithms across different reasoning tasks. Our code and data are publicly available at https://github.com/starreeze/latr.