🤖 AI Summary
Addressing the challenge of balancing inference acceleration and performance preservation in large language model (LLM) pruning—particularly when full-model sparsity-aware fine-tuning is unavailable and performance degrades significantly—this paper proposes a decoder-block-level, region-gradient-driven pruning framework. Methodologically, it introduces region gradients to refine pruning scores, enabling more precise identification of redundant parameters; designs a sparse output consistency objective to minimize decoding output deviation; and supports LoRA-based collaborative fine-tuning, ensuring lightweight, efficient adaptation while remaining orthogonal and compatible with sparsity-aware fine-tuning. Evaluated on LLaMA-7B, the method completes pruning in just 10 minutes on a single H100 GPU. It reduces perplexity by up to 32% over Wanda and demonstrates strong generalization across downstream tasks.
📝 Abstract
Large Language Models (LLMs) pruning seeks to remove unimportant weights for inference speedup with minimal performance impact. However, existing methods often suffer from performance loss without full-model sparsity-aware fine-tuning. This paper presents Wanda++, a novel pruning framework that outperforms the state-of-the-art methods by utilizing decoder-block-level extbf{regional} gradients. Specifically, Wanda++ improves the pruning score with regional gradients for the first time and proposes an efficient regional optimization method to minimize pruning-induced output discrepancies between the dense and sparse decoder output. Notably, Wanda++ improves perplexity by up to 32% over Wanda in the language modeling task and generalizes effectively to downstream tasks. Further experiments indicate our proposed method is orthogonal to sparsity-aware fine-tuning, where Wanda++ can be combined with LoRA fine-tuning to achieve a similar perplexity improvement as the Wanda method. The proposed method is lightweight, pruning a 7B LLaMA model in under 10 minutes on a single NVIDIA H100 GPU.