🤖 AI Summary
Deploying large language models (LLMs) in resource-constrained settings remains challenging, and knowledge distillation suffers from inefficient knowledge transfer due to the substantial capacity gap between teacher and student models.
Method: This paper proposes Temporal Adaptive Interpolation Distillation (TAID), a novel distillation framework featuring a probability-distribution-based dynamic interpolation mechanism that constructs tunable intermediate distributions to jointly optimize mode diversity and knowledge fidelity—rigorously proven to prevent mode collapse. Technically, TAID integrates temperature-adaptive KL divergence optimization, progressive distribution alignment, and dynamic soft-label generation.
Contribution/Results: TAID consistently outperforms state-of-the-art distillation methods on both instruction-tuning and pretraining tasks. We publicly release two efficient foundation models—TAID-LLM-1.5B and TAID-VLM-2B—establishing a new paradigm for lightweight, multimodal LLM deployment.
📝 Abstract
Causal language models have demonstrated remarkable capabilities, but their size poses significant challenges for deployment in resource-constrained environments. Knowledge distillation, a widely-used technique for transferring knowledge from a large teacher model to a small student model, presents a promising approach for model compression. A significant remaining issue lies in the major differences between teacher and student models, namely the substantial capacity gap, mode averaging, and mode collapse, which pose barriers during distillation. To address these issues, we introduce $ extit{Temporally Adaptive Interpolated Distillation (TAID)}$, a novel knowledge distillation approach that dynamically interpolates student and teacher distributions through an adaptive intermediate distribution, gradually shifting from the student's initial distribution towards the teacher's distribution. We provide a theoretical analysis demonstrating TAID's ability to prevent mode collapse and empirically show its effectiveness in addressing the capacity gap while balancing mode averaging and mode collapse. Our comprehensive experiments demonstrate TAID's superior performance across various model sizes and architectures in both instruction tuning and pre-training scenarios. Furthermore, we showcase TAID's practical impact by developing two state-of-the-art compact foundation models: $ exttt{TAID-LLM-1.5B}$ for language tasks and $ exttt{TAID-VLM-2B}$ for vision-language tasks. These results demonstrate TAID's effectiveness in creating high-performing and efficient models, advancing the development of more accessible AI technologies.