π€ AI Summary
This study systematically investigates the applicability of parameter-efficient fine-tuning (PEFT) methods within the verifiable-reward reinforcement learning (RLVR) paradigm, specifically for enhancing mathematical reasoning. Using the DeepSeek-R1-Distill model family and a unified RLVR training framework, it conducts the first comprehensive empirical evaluation of over 12 PEFT techniques. Key findings include: (i) DoRA, AdaLoRA, and MiSS substantially outperform standard LoRA; (ii) SVD-based initialization induces spectral collapseβa primary cause of performance degradation; and (iii) extremely low-rank configurations severely impair reasoning generalization. Through ablation studies, singular value spectrum analysis, and large-scale empirical validation, the work establishes a performance ranking of PEFT methods in RLVR and identifies critical design principles. It delivers the first empirically grounded benchmark and reproducible practical guidelines for parameter-efficient reinforcement learning.
π Abstract
We systematically evaluate Parameter-Efficient Fine-Tuning (PEFT) methods under the paradigm of Reinforcement Learning with Verifiable Rewards (RLVR). RLVR incentivizes language models to enhance their reasoning capabilities through verifiable feedback; however, while methods like LoRA are commonly used, the optimal PEFT architecture for RLVR remains unidentified. In this work, we conduct the first comprehensive evaluation of over 12 PEFT methodologies across the DeepSeek-R1-Distill families on mathematical reasoning benchmarks. Our empirical results challenge the default adoption of standard LoRA with three main findings. First, we demonstrate that structural variants, such as DoRA, AdaLoRA, and MiSS, consistently outperform LoRA. Second, we uncover a spectral collapse phenomenon in SVD-informed initialization strategies ( extit{e.g.,} PiSSA, MiLoRA), attributing their failure to a fundamental misalignment between principal-component updates and RL optimization. Furthermore, our ablations reveal that extreme parameter reduction ( extit{e.g.,} VeRA, Rank-1) severely bottlenecks reasoning capacity. We further conduct ablation studies and scaling experiments to validate our findings. This work provides a definitive guide for advocating for more exploration for parameter-efficient RL methods.