From $P(y|x)$ to $P(y)$: Investigating Reinforcement Learning in Pre-train Space

📅 2026-04-15
📈 Citations: 0
Influential: 0
📄 PDF

career value

185K/year
🤖 AI Summary
Traditional reinforcement learning is constrained by the conditional output distribution of base models, and static pretraining often induces distributional shift, hindering effective enhancement of reasoning capabilities. This work proposes the PreRL framework, which, for the first time, integrates reinforcement learning into the pretraining phase to directly optimize the marginal output distribution $P(y)$. It introduces a Negative Sample Reinforcement (NSR) mechanism to elicit intrinsic reflective behavior within the model, enabling efficient pruning of erroneous reasoning paths. Furthermore, the paper presents a Dual-Space Reinforcement Learning (DSRL) strategy that combines policy reincarnation with gradient alignment to achieve stepwise improvements in reasoning ability. Experiments demonstrate that NSR-PreRL enhances chain-of-thought and reflective reasoning by 14.89× and 6.54×, respectively, while DSRL significantly outperforms strong baselines across multiple tasks, effectively guiding policy convergence toward high-quality reasoning subspaces.

Technology Category

Application Category

📝 Abstract
While reinforcement learning with verifiable rewards (RLVR) significantly enhances LLM reasoning by optimizing the conditional distribution P(y|x), its potential is fundamentally bounded by the base model's existing output distribution. Optimizing the marginal distribution P(y) in the Pre-train Space addresses this bottleneck by encoding reasoning ability and preserving broad exploration capacity. Yet, conventional pre-training relies on static corpora for passive learning, leading to a distribution shift that hinders targeted reasoning enhancement. In this paper, we introduce PreRL (Pre-train Space RL), which applies reward-driven online updates directly to P(y). We theoretically and empirically validate the strong gradient alignment between log P(y) and log P(y|x), establishing PreRL as a viable surrogate for standard RL. Furthermore, we uncover a critical mechanism: Negative Sample Reinforcement (NSR) within PreRL serves as an exceptionally effective driver for reasoning. NSR-PreRL rapidly prunes incorrect reasoning spaces while stimulating endogenous reflective behaviors, increasing transition and reflection thoughts by 14.89x and 6.54x, respectively. Leveraging these insights, we propose Dual Space RL (DSRL), a Policy Reincarnation strategy that initializes models with NSR-PreRL to expand the reasoning horizon before transitioning to standard RL for fine-grained optimization. Extensive experiments demonstrate that DSRL consistently outperforms strong baselines, proving that pre-train space pruning effectively steers the policy toward a refined correct reasoning subspace.
Problem

Research questions and friction points this paper is trying to address.

Reinforcement Learning
Pre-train Space
Reasoning Enhancement
Distribution Shift
Marginal Distribution
Innovation

Methods, ideas, or system contributions that make the work stand out.

Pre-train Space RL
Negative Sample Reinforcement
Marginal Distribution Optimization
Dual Space RL
Policy Reincarnation
🔎 Similar Papers
No similar papers found.