Self-Distilled RLVR

📅 2026-04-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the issues of information leakage and prolonged training instability commonly observed in existing self-distillation methods that leverage privileged information. To overcome these limitations, the authors propose a Reinforcement Learning-based Self-Distillation (RLSD) framework that integrates a Verifiable Reward Mechanism (RLVR) with Online Policy Self-Distillation (OPSD). This integration enables fine-grained, token-level policy update signals while preserving training stability. By utilizing environmental feedback to filter reliable update directions and combining them with self-distillation for precise magnitude adjustments, RLSD significantly enhances both model performance and convergence upper bounds. Empirical evaluations demonstrate that the proposed method achieves superior stability and final performance across multiple tasks compared to existing approaches.
📝 Abstract
On-policy distillation (OPD) has become a popular training paradigm in the LLM community. This paradigm selects a larger model as the teacher to provide dense, fine-grained signals for each sampled trajectory, in contrast to reinforcement learning with verifiable rewards (RLVR), which only obtains sparse signals from verifiable outcomes in the environment. Recently, the community has explored on-policy self-distillation (OPSD), where the same model serves as both teacher and student, with the teacher receiving additional privileged information such as reference answers to enable self-evolution. This paper demonstrates that learning signals solely derived from the privileged teacher result in severe information leakage and unstable long-term training. Accordingly, we identify the optimal niche for self-distillation and propose \textbf{RLSD} (\textbf{RL}VR with \textbf{S}elf-\textbf{D}istillation). Specifically, we leverage self-distillation to obtain token-level policy differences for determining fine-grained update magnitudes, while continuing to use RLVR to derive reliable update directions from environmental feedback (e.g., response correctness). This enables RLSD to simultaneously harness the strengths of both RLVR and OPSD, achieving a higher convergence ceiling and superior training stability.
Problem

Research questions and friction points this paper is trying to address.

self-distillation
reinforcement learning
information leakage
training stability
on-policy distillation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-Distillation
Reinforcement Learning with Verifiable Rewards
On-Policy Distillation
Token-level Policy Difference
Training Stability
🔎 Similar Papers
No similar papers found.
Chenxu Yang
Chenxu Yang
Institute of Information Engineering, Chinese Academy of Sciences
NLPDialogue Generation
C
Chuanyu Qin
Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China; School of Cyber Security, University of Chinese Academy of Sciences, Beijing, China
Q
Qingyi Si
JD.COM
M
Minghui Chen
Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China; School of Cyber Security, University of Chinese Academy of Sciences, Beijing, China
N
Naibin Gu
Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China; School of Cyber Security, University of Chinese Academy of Sciences, Beijing, China
D
Dingyu Yao
Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China; School of Cyber Security, University of Chinese Academy of Sciences, Beijing, China
Zheng Lin
Zheng Lin
Institute of Information Engineering, CAS
NLP
Weiping Wang
Weiping Wang
School of Information Science and Engineering, Central South University
Computer NetworkNetwork Security
Jiaqi Wang
Jiaqi Wang
Unknown affiliation
Nan Duan
Nan Duan
JD.Com (now) | StepFun | Microsoft Research
NLPArtificial General Intelligence