Clipping Bottleneck: Stabilizing RLVR via Stochastic Recovery of Near-Boundary Signals

๐Ÿ“… 2026-05-21
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF

career value

237K/year
๐Ÿค– AI Summary
This work identifies and characterizes a previously overlooked โ€œclipping bottleneckโ€ in RLVR training, wherein hard clipping discards informative signals near the clipping boundary, leading to unstable optimization and suboptimal convergence. To address this issue, the authors propose a lightweight, plug-and-play stochastic rescue mechanism that randomly retains slightly out-of-bound tokens in the vicinity of the clipping threshold, thereby implicitly modulating gradient expectations. Built upon a GRPO-style objective, the method requires no architectural modifications and consistently enhances training stability across both dense and Mixture-of-Experts (MoE) models ranging from 7B to 30B parameters. Empirical results demonstrate clear and consistent improvements over strong baselines such as DAPO and GSPO.
๐Ÿ“ Abstract
Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a central paradigm for scaling LLM reasoning, yet its optimization often suffers from training instability and suboptimal convergence. Through a systematic dissection of clipping-based GRPO-style objectives, we identify the rigid clipping decision induced by hard clipping as a key practical bottleneck in the studied RLVR setups. Specifically, our analysis suggests that informative signals can lie in the near-boundary region just beyond the clipping threshold, and are therefore discarded by the standard hard-clipping rule. Notably, once this bottleneck is precisely identified, even simple stochastic perturbations at the boundary can recover meaningful performance gains. Building on this finding, we propose Near-boundary Stochastic Rescue (NSR), a minimal, plug-and-play modification that stochastically retains these slightly out-of-bound tokens to recover lost signals. While NSR, via stochastic sampling, can be interpreted as inducing an implicit gradient decay in expectation, our ablations reveal that its stochastic, boundary-local rescue mechanism is consistently more effective than deterministic gradient decay. Validated by extensive experiments across model sizes from 7B to 30B and both dense and MoE architectures, as a plug-and-play solution, NSR substantially improves training stability and delivers consistent gains over strong baselines such as DAPO and GSPO.
Problem

Research questions and friction points this paper is trying to address.

Reinforcement Learning with Verifiable Rewards
training instability
clipping bottleneck
near-boundary signals
suboptimal convergence
Innovation

Methods, ideas, or system contributions that make the work stand out.

clipping bottleneck
stochastic rescue
RLVR
near-boundary signals
training stability
๐Ÿ”Ž Similar Papers
No similar papers found.