Well Begun, Half Done: Reinforcement Learning with Prefix Optimization for LLM Reasoning

📅 2025-12-17

📈 Citations: 0

✨ Influential: 0

career value

189K/year

🤖 AI Summary

Existing RLVR methods uniformly optimize all generated tokens, overlooking the critical role of prefix tokens in reasoning—leading to computational waste and constrained improvement for high-reward tokens. This paper proposes Prefix-aware RLVR, the first framework to identify and exploit the “initial-locking effect.” It introduces two novel strategies: progressive prefix preservation and continuation-based cumulative reward estimation, enabling inference-startpoint-driven policy optimization. The method jointly optimizes path-dependent modeling, Monte Carlo sampling, and policy gradients. Evaluated across diverse reasoning tasks, it substantially outperforms state-of-the-art RLVR approaches, achieving an 18.02% absolute accuracy gain while requiring only 26.17% of the training tokens needed by baseline methods to reach peak performance—demonstrating significant improvements in both training efficiency and inference quality.

Technology Category

Application Category

📝 Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) significantly enhances the reasoning capability of Large Language Models (LLMs). Current RLVR approaches typically conduct training across all generated tokens, but neglect to explore which tokens (e.g., prefix tokens) actually contribute to reasoning. This uniform training strategy spends substantial effort on optimizing low-return tokens, which in turn impedes the potential improvement from high-return tokens and reduces overall training effectiveness. To address this issue, we propose a novel RLVR approach called Progressive Prefix-token Policy Optimization (PPPO), which highlights the significance of the prefix segment of generated outputs. Specifically, inspired by the well-established human thinking theory of Path Dependence, where early-stage thoughts substantially constrain subsequent thinking trajectory, we identify an analogous phenomenon in LLM reasoning termed Beginning Lock-in Effect (BLE). PPPO leverages this finding by focusing its optimization objective on the prefix reasoning process of LLMs. This targeted optimization strategy can positively influence subsequent reasoning processes, and ultimately improve final results. To improve the learning effectiveness of LLMs on how to start reasoning with high quality, PPPO introduces two training strategies: (a) Progressive Prefix Retention, which shapes a progressive learning process by increasing the proportion of retained prefix tokens during training; (b) Continuation Accumulated Reward, which mitigates reward bias by sampling multiple continuations for one prefix token sequence, and accumulating their scores as the reward signal. Extensive experimental results on various reasoning tasks demonstrate that our proposed PPPO outperforms representative RLVR methods, with the accuracy improvements of 18.02% on only 26.17% training tokens.

Problem

Research questions and friction points this paper is trying to address.

Optimizes prefix tokens to enhance LLM reasoning efficiency

Addresses uniform training inefficiency by focusing on high-impact tokens

Introduces targeted strategies to improve reasoning start quality

Innovation

Methods, ideas, or system contributions that make the work stand out.

Focuses on optimizing prefix tokens for reasoning

Uses progressive retention to enhance learning process

Accumulates continuation rewards to reduce bias

🔎 Similar Papers

Semantic Self-Consistency: Enhancing Language Model Reasoning via Semantic Weighting