SDMPrune: Self-Distillation MLP Pruning for Efficient Large Language Models

📅 2025-06-10

📈 Citations: 0

✨ Influential: 0

career value

199K/year

🤖 AI Summary

Deploying large language models (LLMs) incurs high computational and memory costs, and existing gradient-based pruning methods—relying on one-hot labels—fail to leverage the full vocabulary-level predictive capability of LLMs. Method: We propose a self-distillation-guided structured pruning method targeting MLP modules. Crucially, we integrate self-distillation loss directly into the pruning process—not merely as a post-training step—enabling soft probability distributions from the original model to guide gradient computation and parameter importance estimation. We focus pruning on MLP layers, which account for over 5× more parameters than attention layers, achieving efficient structured sparsity. Contribution/Results: Our method significantly outperforms prior pruning approaches across multiple zero-shot benchmarks. On open-source 1B-scale LLMs, it achieves state-of-the-art performance: model size is substantially reduced while preserving near-original generative capability.

Technology Category

Application Category

📝 Abstract

In spite of strong performance achieved by LLMs, the costs of their deployment are unaffordable. For the compression of LLMs, gradient-based pruning methods present promising effectiveness. However, in these methods, the gradient computation with one-hot labels ignore the potential predictions on other words, thus missing key information for generative capability of the original model. To address this issue, we introduce a self-distillation loss during the pruning phase (rather than post-training) to fully exploit the predictions of the original model, thereby obtaining more accurate gradient information for pruning. Moreover, we find that, compared to attention modules, the predictions of LLM are less sensitive to multilayer perceptron (MLP) modules, which take up more than $5 imes$ parameters (LLaMA3.2-1.2B). To this end, we focus on the pruning of MLP modules, to significantly compress LLM without obvious performance degradation. Experimental results on extensive zero-shot benchmarks demonstrate that our method significantly outperforms existing pruning methods. Furthermore, our method achieves very competitive performance among 1B-scale open source LLMs. The source code and trained weights are available at https://github.com/visresearch/SDMPrune.

Problem

Research questions and friction points this paper is trying to address.

Reduce deployment costs of large language models

Improve gradient-based pruning with self-distillation loss

Focus on pruning MLP modules for efficiency

Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-distillation loss enhances pruning accuracy

Focuses on pruning MLP modules for efficiency

Achieves competitive performance in 1B-scale LLMs

🔎 Similar Papers

BlockPruner: Fine-grained Pruning for Large Language Models