Wanda++: Pruning Large Language Models via Regional Gradients

📅 2025-03-06

📈 Citations: 0

✨ Influential: 0

career value

220K/year

🤖 AI Summary

Addressing the challenge of balancing inference acceleration and performance preservation in large language model (LLM) pruning—particularly when full-model sparsity-aware fine-tuning is unavailable and performance degrades significantly—this paper proposes a decoder-block-level, region-gradient-driven pruning framework. Methodologically, it introduces region gradients to refine pruning scores, enabling more precise identification of redundant parameters; designs a sparse output consistency objective to minimize decoding output deviation; and supports LoRA-based collaborative fine-tuning, ensuring lightweight, efficient adaptation while remaining orthogonal and compatible with sparsity-aware fine-tuning. Evaluated on LLaMA-7B, the method completes pruning in just 10 minutes on a single H100 GPU. It reduces perplexity by up to 32% over Wanda and demonstrates strong generalization across downstream tasks.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) pruning seeks to remove unimportant weights for inference speedup with minimal performance impact. However, existing methods often suffer from performance loss without full-model sparsity-aware fine-tuning. This paper presents Wanda++, a novel pruning framework that outperforms the state-of-the-art methods by utilizing decoder-block-level extbf{regional} gradients. Specifically, Wanda++ improves the pruning score with regional gradients for the first time and proposes an efficient regional optimization method to minimize pruning-induced output discrepancies between the dense and sparse decoder output. Notably, Wanda++ improves perplexity by up to 32% over Wanda in the language modeling task and generalizes effectively to downstream tasks. Further experiments indicate our proposed method is orthogonal to sparsity-aware fine-tuning, where Wanda++ can be combined with LoRA fine-tuning to achieve a similar perplexity improvement as the Wanda method. The proposed method is lightweight, pruning a 7B LLaMA model in under 10 minutes on a single NVIDIA H100 GPU.

Problem

Research questions and friction points this paper is trying to address.

Improves pruning efficiency in Large Language Models

Reduces performance loss during model pruning

Enhances language modeling and downstream task performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Utilizes decoder-block-level regional gradients

Proposes efficient regional optimization method

Lightweight pruning on single GPU

🔎 Similar Papers

No similar papers found.