๐ค AI Summary
In verifiable reward-based reinforcement learning (RLVR), limited-batch updates are prone to sampling bias and semantic coupling, leading to policy over-sharpening and collapse into a few dominant modes, thereby compromising solution diversity. To address this issue, this work proposes an inverse success advantage calibration mechanism that prioritizes difficult queries, coupled with a memory networkโdriven distribution-level calibration approach to enhance sampling diversity. Through formal analysis and empirical evaluation, the proposed method effectively mitigates policy collapse and significantly improves model generalization on logic-intensive tasks.
๐ Abstract
Reinforcement Learning with Verifiable Rewards (RLVR) is a central paradigm for turning large language models (LLMs) into reliable problem solvers, especially in logic-heavy domains. Despite its empirical success, it remains unclear whether RLVR elicits novel capabilities or merely sharpens the distribution over existing knowledge. We study this by formalizing over-sharpening, a phenomenon where the policy collapses onto limited modes, suppressing valid alternatives. At a high level, we discover finite-batch updates intrinsically bias learning toward sampled modes, triggering a collapse that propagates globally via semantic coupling. To mitigate this, we propose inverse-success advantage calibration to prioritize difficult queries and distribution-level calibration to diversify sampling via a memory network. Empirical evaluations validate that our strategies can effectively improve generalization.