π€ AI Summary
This work uncovers an implicit reward overfitting phenomenon in Reinforcement Learning with Verifiable Rewards (RLVR) and demonstrates that mathematical reasoning capabilities are encoded exclusively within rank-1 components of the modelβs weight matrices. By employing periodic rank-1 substitution, singular value spectrum analysis, and singular vector alignment metrics, the study elucidates how RLVR enhances sampling efficiency through the optimization of specific low-rank structures. The research establishes three key characteristics of RLVR training: a decoupling between test performance and training rewards, heavy-tailed distributions of singular values in linear layers, and high alignment among left singular vectors. These findings provide the first theoretical characterization of RLVRβs generalization behavior and offer foundational insights into the mechanisms underlying its empirical success.
π Abstract
Recent extensive research has demonstrated that the enhanced reasoning capabilities acquired by models through Reinforcement Learning with Verifiable Rewards (RLVR) are primarily concentrated within the rank-1 components. Predicated on this observation, we employed Periodic Rank-1 Substitution and identified a counterintuitive phenomenon: RLVR may exhibit implicit reward overfitting to the training dataset. Specifically, the model can achieve satisfactory performance on the test set even when its rewards remain relatively low during the training process. Furthermore, we characterize three distinct properties of RL training: (1) The effective rank-1 component in RLVR don't maintain other model knowledge except mathematical reasoning capability. (2) RLVR fundamentally functions by optimizing a specific singular spectrum. The distribution of singular values of almost all linear layers in RLVR-trained model behaves like heavy-tailed distribution. (3) the left singular vectors associated with rank-1 components demonstrate a stronger alignment tendency during training, which echoes the discovery that RLVR is optimizing sampling efficiency in essence. Taken together, our findings and analysis further reveal how RLVR shapes model parameters and offer potential insights for improving existing RL paradigms or other training paradigms to implement continual learning.