SpecKD: Speculative Decoding for Effective Knowledge Distillation of LLMs

📅 2025-10-27

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Traditional knowledge distillation uniformly applies distillation loss across all tokens, ignoring the teacher model’s varying confidence levels—often propagating high-entropy, uncertain predictions as supervision signals, thereby introducing noise and degrading student performance. To address this, we propose Token-Gated Distillation (TGD), a token-level dynamic gating framework inspired by the “propose–verify” paradigm of speculative decoding: KL divergence loss is applied only to tokens confidently accepted by the teacher, while low-confidence tokens are automatically masked. The gating function is determined in real time by measuring output distribution consistency between teacher and student, and is jointly optimized end-to-end during training. Experiments across diverse text generation tasks demonstrate that TGD consistently outperforms state-of-the-art distillation methods, yielding substantial gains in student model accuracy, robustness, and training stability—achieving new SOTA results.

Technology Category

Application Category

📝 Abstract

Knowledge Distillation (KD) has become a cornerstone technique for compressing Large Language Models (LLMs) into smaller, more efficient student models. However, conventional KD approaches typically apply the distillation loss uniformly across all tokens, regardless of the teacher's confidence. This indiscriminate mimicry can introduce noise, as the student is forced to learn from the teacher's uncertain or high-entropy predictions, which may ultimately harm student performance-especially when the teacher is much larger and more powerful. To address this, we propose Speculative Knowledge Distillation (SpecKD), a novel, plug-and-play framework that introduces a dynamic, token-level gating mechanism inspired by the "propose-and-verify" paradigm of speculative decoding. At each step, the student's token proposal is verified against the teacher's distribution; the distillation loss is selectively applied only to "accepted" tokens, while "rejected" tokens are masked out. Extensive experiments on diverse text generation tasks show that SpecKD consistently and significantly outperforms strong KD baselines, leading to more stable training and more capable student models, and achieving state-of-the-art results.

Problem

Research questions and friction points this paper is trying to address.

Selectively applies distillation loss to confident teacher tokens

Prevents student learning from uncertain high-entropy predictions

Uses speculative decoding paradigm for dynamic token gating

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic token-level gating mechanism for distillation

Selective loss on accepted tokens from verification

Plug-and-play framework inspired by speculative decoding

🔎 Similar Papers

No similar papers found.

Authors to Follow