Where Rollouts Begin: Low-Load, High-Leverage First-Token Diversification for RLVR

πŸ“… 2026-05-27
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the limited diversity of reasoning paths in reinforcement learning with verifiable rewards (RLVR), which constrains performance gains. The authors propose a lightweight yet high-leverage strategy that focuses on the first token following the reasoning start marker, applying the REFT method to uniformly sample from the policy’s top-N candidates and evenly allocate rollout resources. This approach substantially enhances path coverage without modifying the correctness signal or any other RLVR components. Experimental results demonstrate consistent and significant improvements over DAPO and GRPO baselines across four base model scales and three difficulty settings, as measured by Pass@1, Pass@8, and Pass@64 metrics.
πŸ“ Abstract
Reinforcement Learning with Verifiable Rewards (RLVR) trains reasoning models without labeled trajectories, relying on grouped rollouts to expose the policy to alternative reasoning paths and a verifier to score them. Rollout diversity has accordingly emerged as a central bottleneck in RLVR, with most existing methods broadening exploration through temperature, prefix, or rollout-selection adjustments. We identify a structurally distinguished but overlooked position for broadening this diversity: the first token after the reasoning marker. The policy's first-token distribution exhibits a sharply peaked yet correctness-decoupled phenomenon, and this first token position can broaden the regions a rollout group covers without altering the correctness signal. We introduce REFT (Rollout Exploration with First-Token Diversification), a light addition to the RLVR pipeline that samples first tokens uniformly from the policy's own top-$N$ candidates and allocates rollouts evenly, leaving every other component unchanged. Trained on the resulting diversified rollouts, REFT improves aggregate Pass@1, Pass@8, and Pass@64 over DAPO and GRPO baselines across four base models (0.5B-7B) and three difficulty regimes.
Problem

Research questions and friction points this paper is trying to address.

rollout diversity
first-token diversification
Reinforcement Learning with Verifiable Rewards
reasoning models
RLVR
Innovation

Methods, ideas, or system contributions that make the work stand out.

first-token diversification
rollout diversity
RLVR
REFT
reasoning models
πŸ”Ž Similar Papers
No similar papers found.