Stable On-Policy Distillation through Adaptive Target Reformulation

📅 2026-01-12
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the mismatch between training and inference distributions in conventional knowledge distillation, as well as the training instability arising from excessive gaps between teacher and student models in multi-strategy distillation. To this end, the authors propose Veto, a method that constructs a geometric bridge in logit space to generate adaptive intermediate target distributions. Veto introduces a tunable parameter β to enable adaptive gradient suppression and control over output diversity. By integrating target reconstruction, forward and reverse KL divergence optimization, adaptive gradient clipping, and diversity regulation mechanisms, Veto significantly enhances training stability and generation quality. Empirical results demonstrate its consistent superiority over supervised fine-tuning and existing policy distillation baselines across diverse reasoning and generative tasks.

Technology Category

Application Category

📝 Abstract
Knowledge distillation (KD) is a widely adopted technique for transferring knowledge from large language models to smaller student models; however, conventional supervised KD often suffers from a distribution mismatch between training and inference. While on-policy KD approaches attempt to mitigate this issue by learning directly from student-generated outputs, they frequently encounter training instabilities because the distributional gap between the novice student and the expert teacher is often too wide to bridge directly. These challenges manifest as pathological gradients in forward KL objectives or diversity collapse in reverse KL regimes. To address these limitations, we propose Veto, an objective-level reformulation that constructs a geometric bridge in the logit space. Unlike prior methods that mix data samples, Veto creates an intermediate target distribution that promotes alignment between the teacher and the student. By introducing a tunable parameter beta, Veto serves as an Adaptive Gradient Veto that stabilizes optimization by suppressing harmful gradients on low-confidence tokens, while simultaneously acting as a Decisiveness Knob to balance reward-driven performance with output diversity. Extensive experiments across various reasoning and generation tasks demonstrate that Veto consistently outperforms supervised fine-tuning and existing on-policy baselines.
Problem

Research questions and friction points this paper is trying to address.

on-policy distillation
distribution mismatch
training instability
knowledge distillation
gradient pathology
Innovation

Methods, ideas, or system contributions that make the work stand out.

on-policy distillation
adaptive target reformulation
gradient veto
logit space bridging
output diversity
I
Ijun Jang
Graduate School of Data Science, Seoul National University
J
J. Yeom
Graduate School of Data Science, Seoul National University
J
Juan Yeo
Graduate School of Data Science, Seoul National University
H
Hyunggu Lim
Graduate School of Data Science, Seoul National University
Taesup Kim
Taesup Kim
Assistant Professor, Seoul National University
Representation LearningTransfer LearningAIMachine LearningDeep Learning