π€ AI Summary
Existing reinforcement learning with verifiable rewards (RLVR) approaches struggle to improve pass@k while enhancing pass@1, thereby limiting large language modelsβ capacity to explore diverse reasoning paths. This work proposes the SAGE framework, which reshapes the anchor distribution in reverse KL regularization and introduces a guidance function \( q(x,y) \) to simultaneously maintain training stability and enhance policy exploration. SAGE achieves the first controllable expansion of the empirical support set, promoting the generation of diverse reasoning trajectories without compromising sampling efficiency. Experimental results demonstrate that the method significantly and concurrently improves both pass@1 and pass@k across multiple mathematical reasoning benchmarks, substantiating a genuine enhancement in reasoning capability.
π Abstract
Recent studies observe that reinforcement learning with verifiable rewards (RLVR) reliably improves pass@1 on reasoning tasks, yet often fails to yield comparable gains in pass@k, raising the question of whether RLVR genuinely enables large language models to acquire novel reasoning abilities or merely enhances the efficiency of sampling reasoning modes already present in the base model. Prior analyses largely support the latter view, attributing this limitation to structural properties of standard RLVR objectives that result in insufficient exploration pressure. In this work, we argue that a central structural constraint arises from reverse-KL regularization, which stabilizes training but inherently anchors the policy to the reference distribution, thereby suppressing the emergence of alternative reasoning modes. However, we show that neither removing the KL term nor replacing it with forward-KL provides a satisfactory solution, as both disrupt the efficiency-coverage trade-off by either inducing reward hacking or allocating probability mass to off-target regions. To resolve this tension, we propose SAGE, a principled framework that enables controllable empirical support expansion by reshaping the reverse-KL anchor distribution itself through a guide function q(x,y), achieving consistent improvements in both pass@1 and pass@k across challenging mathematical reasoning benchmarks. Our code is available at https://github.com/tally0818/SAGE.