Speculative Knowledge Distillation: Bridging the Teacher-Student Gap Through Interleaved Sampling

๐Ÿ“… 2024-10-15
๐Ÿ›๏ธ arXiv.org
๐Ÿ“ˆ Citations: 1
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
In knowledge distillation, supervised methods suffer from train-inference distribution mismatch, while on-policy approaches yield inaccurate teacher feedback due to low-quality student-generated samples. This paper proposes Speculative Distillationโ€”a novel framework where the student first generates candidate token sequences, and the teacher dynamically corrects only low-confidence tokens, enabling high-fidelity knowledge transfer under inference-time distribution alignment. Its core innovation is the first online, token-level, teacher-student collaborative correction mechanism, integrating confidence-driven interleaved sampling, teacher-guided dynamic reweighting, and multi-task joint training. Evaluated across machine translation, summarization, mathematical reasoning, and instruction-following tasks, the method consistently outperforms both supervised and on-policy distillation baselines. It demonstrates robust performance gains across diverse model scales, data regimes, and initialization strategies.

Technology Category

Application Category

๐Ÿ“ Abstract
Recent advances in knowledge distillation (KD) have enabled smaller student models to approach the performance of larger teacher models. However, popular methods such as supervised KD and on-policy KD, are adversely impacted by the knowledge gaps between teacher-student in practical scenarios. Supervised KD suffers from a distribution mismatch between training with a static dataset and inference over final student-generated outputs. Conversely, on-policy KD, which uses student-generated samples for training, can suffer from low-quality training examples with which teacher models are not familiar, resulting in inaccurate teacher feedback. To address these limitations, we introduce Speculative Knowledge Distillation (SKD), a novel approach that leverages cooperation between student and teacher models to generate high-quality training data on-the-fly while aligning with the student's inference-time distribution. In SKD, the student proposes tokens, and the teacher replaces poorly ranked ones based on its own distribution, transferring high-quality knowledge adaptively. We evaluate SKD on various text generation tasks, including translation, summarization, math, and instruction following, and show that SKD consistently outperforms existing KD methods across different domains, data sizes, and model initialization strategies.
Problem

Research questions and friction points this paper is trying to address.

Addresses teacher-student knowledge gap in distillation.
Improves training data quality in knowledge distillation.
Enhances student model performance across diverse tasks.
Innovation

Methods, ideas, or system contributions that make the work stand out.

SKD bridges teacher-student knowledge gaps adaptively.
Student proposes tokens; teacher refines poorly ranked ones.
SKD outperforms existing KD methods across multiple tasks.
๐Ÿ”Ž Similar Papers
No similar papers found.