DistiLLM-2: A Contrastive Approach Boosts the Distillation of LLMs

📅 2025-03-10

📈 Citations: 0

✨ Influential: 0

career value

175K/year

🤖 AI Summary

Existing LLM knowledge distillation methods apply uniform loss functions to all teacher-student generated data, neglecting the intrinsic alignment between loss design and heterogeneous data types—such as instructions, code, preferences, and multimodal inputs—thereby limiting performance gains. This work proposes a data-type-aware contrastive distillation framework. It introduces (i) a novel bidirectional contrastive loss that explicitly distinguishes teacher and student responses; (ii) the first dynamic coupling mechanism between loss functions and data types; and (iii) integrated techniques including hierarchical response modeling, task-adaptive weighting, and cross-modal extension. Evaluated on instruction-following and code generation, the method significantly outperforms state-of-the-art distillation approaches. It further supports preference alignment and vision-language joint distillation, enabling compact student models to retain over 92% of teacher model capability across diverse tasks.

Technology Category

Application Category

📝 Abstract

Despite the success of distillation in large language models (LLMs), most prior work applies identical loss functions to both teacher- and student-generated data. These strategies overlook the synergy between loss formulations and data types, leading to a suboptimal performance boost in student models. To address this, we propose DistiLLM-2, a contrastive approach that simultaneously increases the likelihood of teacher responses and decreases that of student responses by harnessing this synergy. Our extensive experiments show that DistiLLM-2 not only builds high-performing student models across a wide range of tasks, including instruction-following and code generation, but also supports diverse applications, such as preference alignment and vision-language extensions. These findings highlight the potential of a contrastive approach to enhance the efficacy of LLM distillation by effectively aligning teacher and student models across varied data types.

Problem

Research questions and friction points this paper is trying to address.

Improves LLM distillation via contrastive loss functions

Enhances student model performance across diverse tasks

Aligns teacher and student models for better synergy

Innovation

Methods, ideas, or system contributions that make the work stand out.

Contrastive approach enhances LLM distillation

Aligns teacher and student model responses

Supports diverse applications and tasks

🔎 Similar Papers

No similar papers found.