🤖 AI Summary
This work addresses the collapse of diversity in reinforcement learning with binary rewards, where policies improve single-sample accuracy at the expense of multimodal coverage. The authors theoretically demonstrate for the first time that this phenomenon stems from distributional degeneracy under policy gradient updates and reveal that KL control implicitly filters models via forward KL divergence. Under model misspecification, decreasing the temperature parameter β leads to highly concentrated outputs. The paper establishes an explicit relationship between β and the target effective rate μ, proving that standard RLVR cannot converge to the ideal filtering model. In contrast, directly optimizing with alternative divergences preserves support coverage. Theoretical analysis grounded in information geometry and variational inference, together with autoregressive experiments, validates the efficacy of the proposed approach.
📝 Abstract
Reinforcement learning with verifiable rewards (RLVR) has become a standard approach for improving reasoning in language models, yet models trained with RLVR often suffer from diversity collapse: while single-sample accuracy improves, multi-sample coverage degrades, sometimes falling below the base model. We provide a structural account of this phenomenon grounded in the properties of binary rewards. Binary rewards create a fundamental degeneracy for policy gradient methods: the set of distributions maximizing expected reward is infinite, with no distinguished element. KL-control resolves this degeneracy by selecting, in the limit $β\to 0$, the filtered model $p_*:=a(\cdot\mid\mathcal{Y}_1)$ -- the base model conditioned on validity -- which is the unique fully valid distribution closest to the base model in KL divergence. This selection operates through a nontrivial asymmetry: the tilted distribution $p_{[β]}\propto a(y)\,e^{v(y)/β}$ converges to $p_*$ in forward KL as $β\to 0$, yet $p_*$ cannot serve as a direct optimization target because $\mathrm{KL}(q\,\|\,p_*)$ is infinite for any full-support policy $q$. We develop explicit formulas relating the hyperparameter $β$ to the more interpretable target validity rate $μ$. Under model misspecification -- the typical practical regime -- the pressure to decrease $β$ drives the optimizer toward highly concentrated distributions over a small number of valid outputs, collapsing toward ever fewer as $β$ decreases, rather than toward the filtered model. We illustrate this mechanism on a toy autoregressive experiment and discuss how alternative divergences that target $p_*$ directly -- as pursued empirically by \citet{kruszewski_whatever_2026} -- avoid this failure mode by rewarding coverage of $p_*$'s support rather than concentration on high-validity outputs.