🤖 AI Summary
Existing LLM knowledge distillation methods apply uniform loss functions to all teacher-student generated data, neglecting the intrinsic alignment between loss design and heterogeneous data types—such as instructions, code, preferences, and multimodal inputs—thereby limiting performance gains. This work proposes a data-type-aware contrastive distillation framework. It introduces (i) a novel bidirectional contrastive loss that explicitly distinguishes teacher and student responses; (ii) the first dynamic coupling mechanism between loss functions and data types; and (iii) integrated techniques including hierarchical response modeling, task-adaptive weighting, and cross-modal extension. Evaluated on instruction-following and code generation, the method significantly outperforms state-of-the-art distillation approaches. It further supports preference alignment and vision-language joint distillation, enabling compact student models to retain over 92% of teacher model capability across diverse tasks.
📝 Abstract
Despite the success of distillation in large language models (LLMs), most prior work applies identical loss functions to both teacher- and student-generated data. These strategies overlook the synergy between loss formulations and data types, leading to a suboptimal performance boost in student models. To address this, we propose DistiLLM-2, a contrastive approach that simultaneously increases the likelihood of teacher responses and decreases that of student responses by harnessing this synergy. Our extensive experiments show that DistiLLM-2 not only builds high-performing student models across a wide range of tasks, including instruction-following and code generation, but also supports diverse applications, such as preference alignment and vision-language extensions. These findings highlight the potential of a contrastive approach to enhance the efficacy of LLM distillation by effectively aligning teacher and student models across varied data types.