Delta Knowledge Distillation for Large Language Models

📅 2025-09-17

📈 Citations: 0

✨ Influential: 0

career value

177K/year

🤖 AI Summary

Existing knowledge distillation methods for large language models implicitly assume that teacher and student models share an identical optimal representation space—an assumption frequently violated after supervised fine-tuning. Method: We propose Delta Knowledge Distillation, the first approach to explicitly model the output distribution shift (Delta) induced in the teacher model during supervised fine-tuning, thereby relaxing the restrictive shared-representation assumption. Our method employs token-level KL divergence–driven distillation to approximate and transfer this Delta, enabling the student to inherit the teacher’s knowledge evolution trajectory. Contribution/Results: On multiple benchmark tasks, our method achieves significant improvements in ROUGE scores over standard distillation baselines. These results empirically validate that explicitly modeling distributional shifts enhances both the effectiveness and generalizability of knowledge transfer in LLM distillation.

Technology Category

Application Category

📝 Abstract

Knowledge distillation (KD) is a widely adopted approach for compressing large neural networks by transferring knowledge from a large teacher model to a smaller student model. In the context of large language models, token level KD, typically minimizing the KL divergence between student output distribution and teacher output distribution, has shown strong empirical performance. However, prior work assumes student output distribution and teacher output distribution share the same optimal representation space, a premise that may not hold in many cases. To solve this problem, we propose Delta Knowledge Distillation (Delta-KD), a novel extension of token level KD that encourages the student to approximate an optimal representation space by explicitly preserving the distributional shift Delta introduced during the teacher's supervised finetuning (SFT). Empirical results on ROUGE metrics demonstrate that Delta KD substantially improves student performance while preserving more of the teacher's knowledge.

Problem

Research questions and friction points this paper is trying to address.

Addresses suboptimal student representation space in knowledge distillation

Proposes Delta-KD to preserve teacher's distributional shift from SFT

Improves student performance while retaining more teacher knowledge

Innovation

Methods, ideas, or system contributions that make the work stand out.

Delta-KD preserves distributional shift from teacher

Extends token-level knowledge distillation for language models

Improves student performance by approximating optimal representation

🔎 Similar Papers

No similar papers found.