🤖 AI Summary
This work addresses the inefficiency of conventional knowledge distillation methods, which treat all tokens uniformly despite their varying contributions to model decisions. To overcome this limitation, the authors propose an adaptive token-level distillation strategy grounded in the entropy of teacher outputs, uniquely integrating entropy information throughout the entire distillation process. The approach employs entropy-driven curriculum learning to dynamically schedule training difficulty, complemented by temperature-adaptive scaling and a dual-branch architecture that jointly distills logits and intermediate features for fine-grained knowledge transfer. Extensive experiments demonstrate that the proposed method significantly enhances the performance of compact student models across multiple benchmarks, substantiating its effectiveness and superiority over existing techniques.
📝 Abstract
Large language models (LLMs) have achieved remarkable performance across diverse domains, yet their enormous computational and memory requirements hinder deployment in resource-constrained environments. Knowledge distillation offers a promising solution by transferring knowledge from a large teacher model to a smaller student model. However, existing distillation methods typically treat all tokens equally, ignoring the fact that different tokens contribute unequally to model decisions. This can lead to inefficient knowledge transfer and reduced learning effectiveness. To address this limitation, we propose an entropy-based adaptive distillation strategy that dynamically adjusts the training process at the token level. Our method leverages the teacher's output entropy to guide three aspects of distillation. Specifically, we introduce a token-level curriculum by dynamically shifting focus from low- to high-entropy tokens during training. We further adjust the distillation temperature based on token entropy to better capture teacher confidence patterns. Moreover, we employ a dual-branch architecture for efficient logits-only distillation on easy tokens and deeper feature-based distillation on difficult tokens. Extensive experiments validate the soundness and effectiveness of our method.