DistiLLM-2: A Contrastive Approach Boosts the Distillation of LLMs

📅 2025-03-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing LLM knowledge distillation methods apply uniform loss functions to all teacher-student generated data, neglecting the intrinsic alignment between loss design and heterogeneous data types—such as instructions, code, preferences, and multimodal inputs—thereby limiting performance gains. This work proposes a data-type-aware contrastive distillation framework. It introduces (i) a novel bidirectional contrastive loss that explicitly distinguishes teacher and student responses; (ii) the first dynamic coupling mechanism between loss functions and data types; and (iii) integrated techniques including hierarchical response modeling, task-adaptive weighting, and cross-modal extension. Evaluated on instruction-following and code generation, the method significantly outperforms state-of-the-art distillation approaches. It further supports preference alignment and vision-language joint distillation, enabling compact student models to retain over 92% of teacher model capability across diverse tasks.

Technology Category

Application Category

📝 Abstract
Despite the success of distillation in large language models (LLMs), most prior work applies identical loss functions to both teacher- and student-generated data. These strategies overlook the synergy between loss formulations and data types, leading to a suboptimal performance boost in student models. To address this, we propose DistiLLM-2, a contrastive approach that simultaneously increases the likelihood of teacher responses and decreases that of student responses by harnessing this synergy. Our extensive experiments show that DistiLLM-2 not only builds high-performing student models across a wide range of tasks, including instruction-following and code generation, but also supports diverse applications, such as preference alignment and vision-language extensions. These findings highlight the potential of a contrastive approach to enhance the efficacy of LLM distillation by effectively aligning teacher and student models across varied data types.
Problem

Research questions and friction points this paper is trying to address.

Improves LLM distillation via contrastive loss functions
Enhances student model performance across diverse tasks
Aligns teacher and student models for better synergy
Innovation

Methods, ideas, or system contributions that make the work stand out.

Contrastive approach enhances LLM distillation
Aligns teacher and student model responses
Supports diverse applications and tasks
🔎 Similar Papers
No similar papers found.
Jongwoo Ko
Jongwoo Ko
Senior Researcher, Microsoft | Ph.D, KAIST AI
Efficient AILarge Language Models
T
Tianyi Chen
Microsoft, Redmond, Washington, USA
Sungnyun Kim
Sungnyun Kim
KAIST AI
audio-visual multimodalspoken languagecomputer vision
Tianyu Ding
Tianyu Ding
University of Pittsburgh
L
Luming Liang
Microsoft, Redmond, Washington, USA
I
Ilya Zharkov
Microsoft, Redmond, Washington, USA
S
Se-Young Yun
KAIST AI, Seoul, Republic of Korea