Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models

📅 2026-01-26

📈 Citations: 7

✨ Influential: 0

career value

167K/year

🤖 AI Summary

This work addresses the distribution mismatch between training and inference in traditional knowledge distillation for language models. To overcome limitations of existing on-policy approaches—which rely on external large teacher models and underutilize actual reasoning trajectories—the authors propose On-Policy Self-Distillation (OPSD), the first framework enabling self-distillation within a single model. In OPSD, the same language model acts as both teacher and student under conditions with and without privileged information (e.g., ground-truth reasoning paths), and is trained by minimizing the token-wise KL divergence between its own generated trajectories. By explicitly leveraging real reasoning traces from the dataset as privileged signals and integrating on-policy rollouts with KL-based optimization, OPSD significantly outperforms off-policy distillation methods and achieves 4–8× higher token efficiency than reinforcement learning baselines such as GRPO, while delivering superior performance on multiple mathematical reasoning benchmarks.

Technology Category

Application Category

📝 Abstract

Knowledge distillation improves large language model (LLM) reasoning by compressing the knowledge of a teacher LLM to train smaller LLMs. On-policy distillation advances this approach by having the student sample its own trajectories while a teacher LLM provides dense token-level supervision, addressing the distribution mismatch between training and inference in off-policy distillation methods. However, on-policy distillation typically requires a separate, often larger, teacher LLM and does not explicitly leverage ground-truth solutions available in reasoning datasets. Inspired by the intuition that a sufficiently capable LLM can rationalize external privileged reasoning traces and teach its weaker self (i.e., the version without access to privileged information), we introduce On-Policy Self-Distillation (OPSD), a framework where a single model acts as both teacher and student by conditioning on different contexts. The teacher policy conditions on privileged information (e.g., verified reasoning traces) while the student policy sees only the question; training minimizes the per-token divergence between these distributions over the student's own rollouts. We demonstrate the efficacy of our method on multiple mathematical reasoning benchmarks, achieving 4-8x token efficiency compared to reinforcement learning methods such as GRPO and superior performance over off-policy distillation methods.

Problem

Research questions and friction points this paper is trying to address.

knowledge distillation

distribution mismatch

reasoning datasets

teacher-student framework

large language models

Innovation

Methods, ideas, or system contributions that make the work stand out.

On-Policy Self-Distillation

Knowledge Distillation

Large Language Models