Multi-Aspect Knowledge Distillation for Language Model with Low-rank Factorization

📅 2026-04-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitations of existing knowledge distillation approaches, which predominantly focus on layer-wise distribution matching while neglecting fine-grained, module-level alignment, thereby hindering effective transfer of linguistic knowledge. To overcome this, we propose a multifaceted knowledge distillation framework that, for the first time, jointly models the internal structures of both self-attention and feed-forward networks during distillation. The method further incorporates low-rank decomposition to enhance computational efficiency. By aligning key components of the teacher and student models from multiple perspectives, our approach achieves substantial improvements over strong baselines under identical parameter budgets, demonstrating particularly outstanding performance in compressing autoregressive architectures.
📝 Abstract
Knowledge distillation is an effective technique for pre-trained language model compression. However, existing methods only focus on the knowledge distribution among layers, which may cause the loss of fine-grained information in the alignment process. To address this issue, we introduce the Multi-aspect Knowledge Distillation (MaKD) method, which mimics the self-attention and feed-forward modules in greater depth to capture rich language knowledge information at different aspects. Experimental results demonstrate that MaKD can achieve competitive performance compared with various strong baselines with the same storage parameter budget. In addition, our method also performs well in distilling auto-regressive architecture models.
Problem

Research questions and friction points this paper is trying to address.

Knowledge Distillation
Language Model Compression
Fine-grained Information
Multi-aspect Knowledge
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-aspect Knowledge Distillation
Low-rank Factorization
Language Model Compression
Self-attention Mimicking
Fine-grained Knowledge Transfer
🔎 Similar Papers
No similar papers found.
Z
Zihe Liu
Key Laboratory of Big Data & Artificial Intelligence in Transportation, School of Computer Science and Technology, Beijing Jiaotong University
Y
Yulong Mao
Key Laboratory of Big Data & Artificial Intelligence in Transportation, School of Computer Science and Technology, Beijing Jiaotong University
Jinan Xu
Jinan Xu
Professor of School of Computer and Information Technology, Beijing Jiaotong University
NLPMachine TranslationLLM
X
Xinrui Peng
School of Computer Science and Technology, Beijing Jiaotong University
Kaiyu Huang
Kaiyu Huang
Beijing Jiaotong University
natural language processingcomputational linguistics