Low-probability Tokens Sustain Exploration in Reinforcement Learning with Verifiable Reward

📅 2025-10-03

📈 Citations: 0

✨ Influential: 0

career value

174K/year

🤖 AI Summary

In Reinforcement Learning with Verifiable Rewards (RLVR), policy entropy collapse degrades exploration, causing critical low-probability exploratory tokens—termed “reasoning sparks”—to be systematically suppressed due to excessive penalization. This paper is the first to formally define and empirically characterize this phenomenon. We propose Low-Probability Regularization (Lp-Reg), a novel online regularization method that constructs a denoised proxy distribution as a soft target, and integrates KL-divergence-based policy regularization with token-level probability reweighting and normalization to preserve sparse but beneficial exploratory actions. Unlike entropy-maximizing approaches, Lp-Reg selectively safeguards meaningful exploration without compromising training stability. Evaluated on five mathematical reasoning benchmarks, our method achieves a mean accuracy of 60.17%, outperforming baselines by +2.66 percentage points, and enables stable training beyond 1,000 steps—significantly mitigating exploration stagnation in RLVR.

Technology Category

Application Category

📝 Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has propelled Large Language Models in complex reasoning, yet its scalability is often hindered by a training bottleneck where performance plateaus as policy entropy collapses, signaling a loss of exploration. Previous methods typically address this by maintaining high policy entropy, yet the precise mechanisms that govern meaningful exploration have remained underexplored. Our analysis suggests that an unselective focus on entropy risks amplifying irrelevant tokens and destabilizing training. This paper investigates the exploration dynamics within RLVR and identifies a key issue: the gradual elimination of valuable low-probability exploratory tokens, which we term extbf{ extit{reasoning sparks}}. We find that while abundant in pre-trained models, these sparks are systematically extinguished during RLVR due to over-penalization, leading to a degeneracy in exploration. To address this, we introduce Low-probability Regularization (Lp-Reg). Its core mechanism regularizes the policy towards a heuristic proxy distribution. This proxy is constructed by filtering out presumed noise tokens and re-normalizing the distribution over the remaining candidates. The result is a less-noisy proxy where the probability of extit{reasoning sparks} is amplified, which then serves as a soft regularization target to shield these valuable tokens from elimination via KL divergence. Experiments show that Lp-Reg enables stable on-policy training for around 1,000 steps, a regime where baseline entropy-control methods collapse. This sustained exploration leads to state-of-the-art performance, achieving a $60.17%$ average accuracy on five math benchmarks, an improvement of $2.66%$ over prior methods. Code is available at https://github.com/CarlanLark/Lp-Reg.

Problem

Research questions and friction points this paper is trying to address.

Addresses exploration collapse in reinforcement learning with verifiable rewards

Identifies over-penalization of low-probability reasoning tokens during training

Proposes regularization to sustain exploratory tokens and prevent training instability

Innovation

Methods, ideas, or system contributions that make the work stand out.

Regularizes policy towards heuristic proxy distribution

Amplifies low-probability tokens via filtered distribution

Uses KL divergence to sustain exploratory tokens

🔎 Similar Papers

Don't flatten, tokenize! Unlocking the key to SoftMoE's efficacy in deep RL