Sparsity Induction for Accurate Post-Training Pruning of Large Language Models

📅 2026-02-25

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

This work addresses the computational and memory efficiency bottlenecks of large language models caused by their massive parameter counts, which often lead to significant performance degradation in existing post-training pruning methods due to the inherent lack of sparsity in the original models. To enhance pruning compatibility, the authors propose inducing sparsity at both the weight distribution and feature representation levels prior to pruning. Specifically, they introduce an absorbable equivalent scaling transformation to promote weight distribution sparsity and design a spectral norm-based loss to encourage low-rank feature sparsity. The proposed approach incurs no additional parameters or inference overhead and consistently outperforms current post-training pruning techniques across diverse model architectures and tasks, achieving highly accurate and efficient model compression.

Technology Category

Application Category

📝 Abstract

Large language models have demonstrated capabilities in text generation, while their increasing parameter scales present challenges in computational and memory efficiency. Post-training sparsity (PTS), which reduces model cost by removing weights from dense networks, is an effective approach. However, native dense matrices lack high sparsity, making existing approaches that directly remove weights disrupt model states, resulting in unsatisfactory performance recovery even with post-tuning. We propose Sparsity Induction, which promotes models toward higher sparsity at both distribution and feature levels before pruning, to push the limits of PTS. At the distribution level, we enhance distributional sparsity through mathematically equivalent scaling transformations, which are fully absorbable and incur no extra parameters or inference-time overhead. At the feature level, we introduce Spectral Norm Loss to promote feature sparsity from a low-rank perspective. Experiments across diverse model architectures and tasks demonstrate that our method further enhances sparsity-friendliness, achieving superior pruning performance over existing approaches.

Problem

Research questions and friction points this paper is trying to address.

post-training pruning

sparsity

large language models

model compression

weight removal

Innovation

Methods, ideas, or system contributions that make the work stand out.

Sparsity Induction

Post-Training Pruning

Distributional Sparsity