Reinforcement-aware Knowledge Distillation for LLM Reasoning

📅 2026-02-25

📈 Citations: 0

✨ Influential: 0

career value

167K/year

🤖 AI Summary

This work addresses the challenges of distribution mismatch and objective conflict that arise when integrating knowledge distillation with reinforcement learning, which hinder the effective transfer of reasoning capabilities from large language models. To overcome these issues, the authors propose a reinforcement learning–aware distillation framework featuring a trust-region ratio distillation mechanism. This approach selectively imitates the teacher policy only when such guidance is beneficial to the current policy update. By anchoring the distillation process to a mixture distribution of the teacher and the old policy, the method enables advantage-aware, trust-region–constrained imitation that naturally balances exploration, exploitation, and mimicry. Experimental results demonstrate that the proposed approach significantly outperforms offline distillation, standard GRPO, and KL-divergence–based online teacher-student distillation across multiple logical reasoning and mathematical reasoning benchmarks.

Technology Category

Application Category

📝 Abstract

Reinforcement learning (RL) post-training has recently driven major gains in long chain-of-thought reasoning large language models (LLMs), but the high inference cost of such models motivates distillation into smaller students. Most existing knowledge distillation (KD) methods are designed for supervised fine-tuning (SFT), relying on fixed teacher traces or teacher-student Kullback-Leibler (KL) divergence-based regularization. When combined with RL, these approaches often suffer from distribution mismatch and objective interference: teacher supervision may not align with the student's evolving rollout distribution, and the KL regularizer can compete with reward maximization and require careful loss balancing. To address these issues, we propose RL-aware distillation (RLAD), which performs selective imitation during RL -- guiding the student toward the teacher only when it improves the current policy update. Our core component, Trust Region Ratio Distillation (TRRD), replaces the teacher-student KL regularizer with a PPO/GRPO-style likelihood-ratio objective anchored to a teacher--old-policy mixture, yielding advantage-aware, trust-region-bounded distillation on student rollouts and naturally balancing exploration, exploitation, and imitation. Across diverse logic reasoning and math benchmarks, RLAD consistently outperforms offline distillation, standard GRPO, and KL-based on-policy teacher-student knowledge distillation.

Problem

Research questions and friction points this paper is trying to address.

knowledge distillation

reinforcement learning

distribution mismatch

objective interference

large language models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Reinforcement Learning

Knowledge Distillation

Trust Region