When Sharpening Becomes Collapse: Sampling Bias and Semantic Coupling in RL with Verifiable Rewards

📅 2026-01-22

📈 Citations: 0

✨ Influential: 0

career value

213K/year

🤖 AI Summary

In verifiable reward-based reinforcement learning (RLVR), limited-batch updates are prone to sampling bias and semantic coupling, leading to policy over-sharpening and collapse into a few dominant modes, thereby compromising solution diversity. To address this issue, this work proposes an inverse success advantage calibration mechanism that prioritizes difficult queries, coupled with a memory network–driven distribution-level calibration approach to enhance sampling diversity. Through formal analysis and empirical evaluation, the proposed method effectively mitigates policy collapse and significantly improves model generalization on logic-intensive tasks.

Technology Category

Application Category

📝 Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) is a central paradigm for turning large language models (LLMs) into reliable problem solvers, especially in logic-heavy domains. Despite its empirical success, it remains unclear whether RLVR elicits novel capabilities or merely sharpens the distribution over existing knowledge. We study this by formalizing over-sharpening, a phenomenon where the policy collapses onto limited modes, suppressing valid alternatives. At a high level, we discover finite-batch updates intrinsically bias learning toward sampled modes, triggering a collapse that propagates globally via semantic coupling. To mitigate this, we propose inverse-success advantage calibration to prioritize difficult queries and distribution-level calibration to diversify sampling via a memory network. Empirical evaluations validate that our strategies can effectively improve generalization.

Problem

Research questions and friction points this paper is trying to address.

sampling bias

semantic coupling

over-sharpening

policy collapse

reinforcement learning with verifiable rewards

Innovation

Methods, ideas, or system contributions that make the work stand out.

over-sharpening

sampling bias

semantic coupling