Multi-Aspect Knowledge Distillation for Language Model with Low-rank Factorization

📅 2026-04-03

📈 Citations: 0

✨ Influential: 0

career value

215K/year

🤖 AI Summary

This work addresses the limitations of existing knowledge distillation approaches, which predominantly focus on layer-wise distribution matching while neglecting fine-grained, module-level alignment, thereby hindering effective transfer of linguistic knowledge. To overcome this, we propose a multifaceted knowledge distillation framework that, for the first time, jointly models the internal structures of both self-attention and feed-forward networks during distillation. The method further incorporates low-rank decomposition to enhance computational efficiency. By aligning key components of the teacher and student models from multiple perspectives, our approach achieves substantial improvements over strong baselines under identical parameter budgets, demonstrating particularly outstanding performance in compressing autoregressive architectures.

Technology Category

Application Category

📝 Abstract

Knowledge distillation is an effective technique for pre-trained language model compression. However, existing methods only focus on the knowledge distribution among layers, which may cause the loss of fine-grained information in the alignment process. To address this issue, we introduce the Multi-aspect Knowledge Distillation (MaKD) method, which mimics the self-attention and feed-forward modules in greater depth to capture rich language knowledge information at different aspects. Experimental results demonstrate that MaKD can achieve competitive performance compared with various strong baselines with the same storage parameter budget. In addition, our method also performs well in distilling auto-regressive architecture models.

Problem

Research questions and friction points this paper is trying to address.

Knowledge Distillation

Language Model Compression

Fine-grained Information

Multi-aspect Knowledge

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-aspect Knowledge Distillation

Low-rank Factorization

Language Model Compression

Self-attention Mimicking