🤖 AI Summary
This work addresses the limitations of existing knowledge distillation approaches, which predominantly focus on layer-wise distribution matching while neglecting fine-grained, module-level alignment, thereby hindering effective transfer of linguistic knowledge. To overcome this, we propose a multifaceted knowledge distillation framework that, for the first time, jointly models the internal structures of both self-attention and feed-forward networks during distillation. The method further incorporates low-rank decomposition to enhance computational efficiency. By aligning key components of the teacher and student models from multiple perspectives, our approach achieves substantial improvements over strong baselines under identical parameter budgets, demonstrating particularly outstanding performance in compressing autoregressive architectures.
📝 Abstract
Knowledge distillation is an effective technique for pre-trained language model compression. However, existing methods only focus on the knowledge distribution among layers, which may cause the loss of fine-grained information in the alignment process. To address this issue, we introduce the Multi-aspect Knowledge Distillation (MaKD) method, which mimics the self-attention and feed-forward modules in greater depth to capture rich language knowledge information at different aspects. Experimental results demonstrate that MaKD can achieve competitive performance compared with various strong baselines with the same storage parameter budget. In addition, our method also performs well in distilling auto-regressive architecture models.