Warmup-Distill: Bridge the Distribution Mismatch between Teacher and Student before Knowledge Distillation

📅 2025-02-17

📈 Citations: 0

✨ Influential: 0

career value

170K/year

🤖 AI Summary

In knowledge distillation for large language models (LLMs), performance degradation arises from misalignment between teacher and student output distributions. Method: This paper proposes a “pre-distillation alignment” mechanism that, prior to formal distillation, dynamically calibrates the student’s low-probability predictions using the teacher model—via probability reweighting, internal knowledge self-checking, and lightweight distribution expansion—to achieve preliminary representation-space alignment. Contribution/Results: This work introduces the first “warm-up-style distribution alignment” paradigm, proactively rectifying the student’s output distribution before distillation begins—thereby avoiding mode averaging and collapse and substantially improving the quality of the distillation starting point. Experiments across seven major benchmarks show an average gain of 0.4 points; on mathematical reasoning tasks, accuracy improves by up to 1.9%. The method significantly enhances both generalization capability and distillation efficiency of smaller student models.

Technology Category

Application Category

📝 Abstract

The widespread deployment of Large Language Models (LLMs) is hindered by the high computational demands, making knowledge distillation (KD) crucial for developing compact smaller ones. However, the conventional KD methods endure the distribution mismatch issue between the teacher and student models, leading to the poor performance of distillation. For instance, the widely-used KL-based methods suffer the mode-averaging and mode-collapsing problems, since the mismatched probabitliy distribution between both models. Previous studies mainly optimize this issue via different distance calculations towards the distribution of both models. Unfortunately, the distribution mismatch issue still exists in the early stage of the distillation. Hence, to reduce the impact of distribution mismatch, we propose a simple yet efficient method, named Warmup-Distill, which aligns the distillation of the student to that of the teacher in advance of distillation. Specifically, we first detect the distribution of the student model in practical scenarios with its internal knowledge, and then modify the knowledge with low probability via the teacher as the checker. Consequently, Warmup-Distill aligns the internal student's knowledge to that of the teacher, which expands the distribution of the student with the teacher's, and assists the student model to learn better in the subsequent distillation. Experiments on the seven benchmarks demonstrate that Warmup-Distill could provide a warmup student more suitable for distillation, which outperforms the vanilla student by as least +0.4 averaged score among all benchmarks. Noteably, with the assistance of Warmup-Distill, the distillation on the math task could yield a further improvement, at most +1.9% accuracy.

Problem

Research questions and friction points this paper is trying to address.

Addresses distribution mismatch in knowledge distillation.

Improves student model performance pre-distillation.

Aligns student knowledge with teacher model.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Warmup-Distill aligns teacher-student distributions

Modifies student's low probability knowledge

Expands student distribution with teacher's

🔎 Similar Papers

Revisiting Knowledge Distillation under Distribution Shift