A Dual-Space Framework for General Knowledge Distillation of Large Language Models

๐Ÿ“… 2025-04-15
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Current white-box knowledge distillation for large language model compression faces two key bottlenecks: (1) misalignment between teacher and student output spaces, limiting effective knowledge transfer, and (2) poor adaptability to heterogeneous vocabularies. To address these, we propose a dual-space distillation framework featuring a unified prediction head, dual projector initialization, mutual hidden-state mapping, andโ€”cruciallyโ€”the novel Exact Token Alignment (ETA) algorithm. This enables vocabulary-agnostic, strategy-independent white-box distillation across disparate tokenizers. Notably, our method is the first to support fine-grained knowledge transfer under arbitrary tokenizer pairings. Extensive experiments on instruction-following, mathematical reasoning, and code generation tasks demonstrate that our approach consistently outperforms existing white-box and cross-tokenizer distillation methods, achieving superior student performance while preserving architectural flexibility.

Technology Category

Application Category

๐Ÿ“ Abstract
Knowledge distillation (KD) is a promising solution to compress large language models (LLMs) by transferring their knowledge to smaller models. During this process, white-box KD methods usually minimize the distance between the output distributions of the teacher model and the student model to transfer more information. However, we reveal that the current white-box KD framework exhibits two limitations: a) bridging probability distributions from different output spaces will limit the similarity between the teacher model and the student model; b) this framework cannot be applied to LLMs with different vocabularies. One of the root causes for these limitations is that the distributions from the teacher and the student for KD are output by different prediction heads, which yield distributions in different output spaces and dimensions. Therefore, in this paper, we propose a dual-space knowledge distillation (DSKD) framework that unifies the prediction heads of the teacher and the student models for KD. Specifically, we first introduce two projectors with ideal initialization to project the teacher/student hidden states into the student/teacher representation spaces. After this, the hidden states from different models can share the same head and unify the output spaces of the distributions. Furthermore, we develop an exact token alignment (ETA) algorithm to align the same tokens in two differently-tokenized sequences. Based on the above, our DSKD framework is a general KD framework that supports both off-policy and on-policy KD, and KD between any two LLMs regardless of their vocabularies. Extensive experiments on instruction-following, mathematical reasoning, and code generation benchmarks show that DSKD significantly outperforms existing methods based on the current white-box KD framework and surpasses other cross-tokenizer KD methods for LLMs with different vocabularies.
Problem

Research questions and friction points this paper is trying to address.

Bridging output spaces limits teacher-student model similarity
Current KD fails with different vocabulary LLMs
Need unified prediction heads for effective distillation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dual-space framework unifies prediction heads
Projectors align hidden states across models
Token alignment algorithm handles different vocabularies
๐Ÿ”Ž Similar Papers
No similar papers found.
X
Xue Zhang
School of Computer Science and Technology, Beijing Key Laboratory of Traffic Data Mining and Embodied Intelligence, Beijing Jiaotong University
Songming Zhang
Songming Zhang
Beijing Jiaotong University
natural language processingtext generationmachine translation
Yunlong Liang
Yunlong Liang
WeChat
Natural Language Processing (NLP)
Fandong Meng
Fandong Meng
WeChat AI, Tencent
Machine TranslationNatural Language Processing
Y
Yufeng Chen
School of Computer Science and Technology, Beijing Key Laboratory of Traffic Data Mining and Embodied Intelligence, Beijing Jiaotong University
Jinan Xu
Jinan Xu
Professor of School of Computer and Information Technology, Beijing Jiaotong University
NLPMachine TranslationLLM
J
Jie Zhou
Pattern Recognition Center, WeChat AI, Tencent Inc