🤖 AI Summary
Existing RLVR methods uniformly optimize all generated tokens, overlooking the critical role of prefix tokens in reasoning—leading to computational waste and constrained improvement for high-reward tokens. This paper proposes Prefix-aware RLVR, the first framework to identify and exploit the “initial-locking effect.” It introduces two novel strategies: progressive prefix preservation and continuation-based cumulative reward estimation, enabling inference-startpoint-driven policy optimization. The method jointly optimizes path-dependent modeling, Monte Carlo sampling, and policy gradients. Evaluated across diverse reasoning tasks, it substantially outperforms state-of-the-art RLVR approaches, achieving an 18.02% absolute accuracy gain while requiring only 26.17% of the training tokens needed by baseline methods to reach peak performance—demonstrating significant improvements in both training efficiency and inference quality.
📝 Abstract
Reinforcement Learning with Verifiable Rewards (RLVR) significantly enhances the reasoning capability of Large Language Models (LLMs). Current RLVR approaches typically conduct training across all generated tokens, but neglect to explore which tokens (e.g., prefix tokens) actually contribute to reasoning. This uniform training strategy spends substantial effort on optimizing low-return tokens, which in turn impedes the potential improvement from high-return tokens and reduces overall training effectiveness. To address this issue, we propose a novel RLVR approach called Progressive Prefix-token Policy Optimization (PPPO), which highlights the significance of the prefix segment of generated outputs. Specifically, inspired by the well-established human thinking theory of Path Dependence, where early-stage thoughts substantially constrain subsequent thinking trajectory, we identify an analogous phenomenon in LLM reasoning termed Beginning Lock-in Effect (BLE). PPPO leverages this finding by focusing its optimization objective on the prefix reasoning process of LLMs. This targeted optimization strategy can positively influence subsequent reasoning processes, and ultimately improve final results. To improve the learning effectiveness of LLMs on how to start reasoning with high quality, PPPO introduces two training strategies: (a) Progressive Prefix Retention, which shapes a progressive learning process by increasing the proportion of retained prefix tokens during training; (b) Continuation Accumulated Reward, which mitigates reward bias by sampling multiple continuations for one prefix token sequence, and accumulating their scores as the reward signal. Extensive experimental results on various reasoning tasks demonstrate that our proposed PPPO outperforms representative RLVR methods, with the accuracy improvements of 18.02% on only 26.17% training tokens.