Evaluating Parameter Efficient Methods for RLVR

📅 2025-12-28

📈 Citations: 0

✨ Influential: 0

career value

214K/year

🤖 AI Summary

This study systematically investigates the applicability of parameter-efficient fine-tuning (PEFT) methods within the verifiable-reward reinforcement learning (RLVR) paradigm, specifically for enhancing mathematical reasoning. Using the DeepSeek-R1-Distill model family and a unified RLVR training framework, it conducts the first comprehensive empirical evaluation of over 12 PEFT techniques. Key findings include: (i) DoRA, AdaLoRA, and MiSS substantially outperform standard LoRA; (ii) SVD-based initialization induces spectral collapse—a primary cause of performance degradation; and (iii) extremely low-rank configurations severely impair reasoning generalization. Through ablation studies, singular value spectrum analysis, and large-scale empirical validation, the work establishes a performance ranking of PEFT methods in RLVR and identifies critical design principles. It delivers the first empirically grounded benchmark and reproducible practical guidelines for parameter-efficient reinforcement learning.

Technology Category

Application Category

📝 Abstract

We systematically evaluate Parameter-Efficient Fine-Tuning (PEFT) methods under the paradigm of Reinforcement Learning with Verifiable Rewards (RLVR). RLVR incentivizes language models to enhance their reasoning capabilities through verifiable feedback; however, while methods like LoRA are commonly used, the optimal PEFT architecture for RLVR remains unidentified. In this work, we conduct the first comprehensive evaluation of over 12 PEFT methodologies across the DeepSeek-R1-Distill families on mathematical reasoning benchmarks. Our empirical results challenge the default adoption of standard LoRA with three main findings. First, we demonstrate that structural variants, such as DoRA, AdaLoRA, and MiSS, consistently outperform LoRA. Second, we uncover a spectral collapse phenomenon in SVD-informed initialization strategies ( extit{e.g.,} PiSSA, MiLoRA), attributing their failure to a fundamental misalignment between principal-component updates and RL optimization. Furthermore, our ablations reveal that extreme parameter reduction ( extit{e.g.,} VeRA, Rank-1) severely bottlenecks reasoning capacity. We further conduct ablation studies and scaling experiments to validate our findings. This work provides a definitive guide for advocating for more exploration for parameter-efficient RL methods.

Problem

Research questions and friction points this paper is trying to address.

Evaluates PEFT methods for RLVR on reasoning tasks

Identifies optimal architectures beyond standard LoRA

Explores spectral collapse in SVD-based initialization strategies

Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluates over 12 PEFT methods for RLVR

Identifies structural variants outperforming standard LoRA

Discovers spectral collapse in SVD-based initializations

🔎 Similar Papers

No similar papers found.