KDRL: Post-Training Reasoning LLMs via Unified Knowledge Distillation and Reinforcement Learning

📅 2025-06-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the trade-off between low sample efficiency in reinforcement learning (RL) and weak out-of-distribution generalization in knowledge distillation (KD) during post-training of large language models, this paper proposes the first unified RL-KD optimization framework. Methodologically, it innovatively integrates GRPO-based policy optimization with reverse KL divergence constraints, designing a reward-guided dynamic distillation mechanism and achieving balanced exploration and supervision via multi-stage KL coefficient scheduling. Our contributions are threefold: (1) a differentiable, end-to-end joint RL-KD objective; (2) significant improvements in reasoning accuracy and token efficiency; and (3) consistent superiority over GRPO and diverse KD baselines across multiple benchmarks, demonstrating both enhanced generalization robustness and training efficiency.

Technology Category

Application Category

📝 Abstract
Recent advances in large language model (LLM) post-training have leveraged two distinct paradigms to enhance reasoning capabilities: reinforcement learning (RL) and knowledge distillation (KD). While RL enables the emergence of complex reasoning behaviors, it often suffers from low sample efficiency when the initial policy struggles to explore high-reward trajectories. Conversely, KD improves learning efficiency via mimicking the teacher model but tends to generalize poorly to out-of-domain scenarios. In this work, we present extbf{KDRL}, a extit{unified post-training framework} that jointly optimizes a reasoning model through teacher supervision (KD) and self-exploration (RL). Specifically, KDRL leverages policy gradient optimization to simultaneously minimize the reverse Kullback-Leibler divergence (RKL) between the student and teacher distributions while maximizing the expected rule-based rewards. We first formulate a unified objective that integrates GRPO and KD, and systematically explore how different KL approximations, KL coefficients, and reward-guided KD strategies affect the overall post-training dynamics and performance. Empirical results on multiple reasoning benchmarks demonstrate that KDRL outperforms GRPO and various KD baselines while achieving a favorable balance between performance and reasoning token efficiency. These findings indicate that integrating KD and RL serves as an effective and efficient strategy to train reasoning LLMs.
Problem

Research questions and friction points this paper is trying to address.

Enhance LLM reasoning via unified KD and RL
Improve sample efficiency in RL exploration
Balance generalization in KD for out-of-domain scenarios
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified KD and RL for LLM post-training
Policy gradient optimizes RKL and rewards
Balances performance and token efficiency
🔎 Similar Papers
No similar papers found.