Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe

📅 2026-04-14
📈 Citations: 0
Influential: 0
📄 PDF

career value

206K/year
🤖 AI Summary
This study addresses the instability of on-policy distillation (OPD) in current large language models, which stems from poorly understood training dynamics. The work systematically identifies two essential conditions for successful OPD: compatibility in reasoning patterns between teacher and student models, and the teacher’s ability to provide novel capabilities. It further reveals that knowledge transfer occurs through progressive alignment on high-probability tokens, with 97%–99% of the probability mass concentrated in a small shared token set. Building on these insights, the authors propose practical remedies—including off-policy cold-start initialization and teacher-aligned prompt selection—that effectively recover failed distillation attempts. Additionally, through weak-to-strong reverse distillation, token-level probing, and distributional distinguishability analysis, this research provides the first characterization of OPD’s inherent limitations in long-horizon knowledge transfer.

Technology Category

Application Category

📝 Abstract
On-policy distillation (OPD) has become a core technique in the post-training of large language models, yet its training dynamics remain poorly understood. This paper provides a systematic investigation of OPD dynamics and mechanisms. We first identify that two conditions govern whether OPD succeeds or fails: (i) the student and teacher should share compatible thinking patterns; and (ii) even with consistent thinking patterns and higher scores, the teacher must offer genuinely new capabilities beyond what the student has seen during training. We validate these findings through weak-to-strong reverse distillation, showing that same-family 1.5B and 7B teachers are distributionally indistinguishable from the student's perspective. Probing into the token-level mechanism, we show that successful OPD is characterized by progressive alignment on high-probability tokens at student-visited states, a small shared token set that concentrates most of the probability mass (97%-99%). We further propose two practical strategies to recover failing OPD: off-policy cold start and teacher-aligned prompt selection. Finally, we show that OPD's apparent free lunch of dense token-level reward comes at a cost, raising the question of whether OPD can scale to long-horizon distillation.
Problem

Research questions and friction points this paper is trying to address.

On-Policy Distillation
Training Dynamics
Large Language Models
Knowledge Distillation
Token-level Alignment
Innovation

Methods, ideas, or system contributions that make the work stand out.

on-policy distillation
thinking pattern compatibility
token-level alignment
off-policy cold start
teacher-aligned prompt selection
🔎 Similar Papers
No similar papers found.