When Sharpening Becomes Collapse: Sampling Bias and Semantic Coupling in RL with Verifiable Rewards

๐Ÿ“… 2026-01-22
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
In verifiable reward-based reinforcement learning (RLVR), limited-batch updates are prone to sampling bias and semantic coupling, leading to policy over-sharpening and collapse into a few dominant modes, thereby compromising solution diversity. To address this issue, this work proposes an inverse success advantage calibration mechanism that prioritizes difficult queries, coupled with a memory networkโ€“driven distribution-level calibration approach to enhance sampling diversity. Through formal analysis and empirical evaluation, the proposed method effectively mitigates policy collapse and significantly improves model generalization on logic-intensive tasks.

Technology Category

Application Category

๐Ÿ“ Abstract
Reinforcement Learning with Verifiable Rewards (RLVR) is a central paradigm for turning large language models (LLMs) into reliable problem solvers, especially in logic-heavy domains. Despite its empirical success, it remains unclear whether RLVR elicits novel capabilities or merely sharpens the distribution over existing knowledge. We study this by formalizing over-sharpening, a phenomenon where the policy collapses onto limited modes, suppressing valid alternatives. At a high level, we discover finite-batch updates intrinsically bias learning toward sampled modes, triggering a collapse that propagates globally via semantic coupling. To mitigate this, we propose inverse-success advantage calibration to prioritize difficult queries and distribution-level calibration to diversify sampling via a memory network. Empirical evaluations validate that our strategies can effectively improve generalization.
Problem

Research questions and friction points this paper is trying to address.

sampling bias
semantic coupling
over-sharpening
policy collapse
reinforcement learning with verifiable rewards
Innovation

Methods, ideas, or system contributions that make the work stand out.

over-sharpening
sampling bias
semantic coupling
advantage calibration
memory network
๐Ÿ”Ž Similar Papers
No similar papers found.