Wanda++: Pruning Large Language Models via Regional Gradients

📅 2025-03-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Addressing the challenge of balancing inference acceleration and performance preservation in large language model (LLM) pruning—particularly when full-model sparsity-aware fine-tuning is unavailable and performance degrades significantly—this paper proposes a decoder-block-level, region-gradient-driven pruning framework. Methodologically, it introduces region gradients to refine pruning scores, enabling more precise identification of redundant parameters; designs a sparse output consistency objective to minimize decoding output deviation; and supports LoRA-based collaborative fine-tuning, ensuring lightweight, efficient adaptation while remaining orthogonal and compatible with sparsity-aware fine-tuning. Evaluated on LLaMA-7B, the method completes pruning in just 10 minutes on a single H100 GPU. It reduces perplexity by up to 32% over Wanda and demonstrates strong generalization across downstream tasks.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) pruning seeks to remove unimportant weights for inference speedup with minimal performance impact. However, existing methods often suffer from performance loss without full-model sparsity-aware fine-tuning. This paper presents Wanda++, a novel pruning framework that outperforms the state-of-the-art methods by utilizing decoder-block-level extbf{regional} gradients. Specifically, Wanda++ improves the pruning score with regional gradients for the first time and proposes an efficient regional optimization method to minimize pruning-induced output discrepancies between the dense and sparse decoder output. Notably, Wanda++ improves perplexity by up to 32% over Wanda in the language modeling task and generalizes effectively to downstream tasks. Further experiments indicate our proposed method is orthogonal to sparsity-aware fine-tuning, where Wanda++ can be combined with LoRA fine-tuning to achieve a similar perplexity improvement as the Wanda method. The proposed method is lightweight, pruning a 7B LLaMA model in under 10 minutes on a single NVIDIA H100 GPU.
Problem

Research questions and friction points this paper is trying to address.

Improves pruning efficiency in Large Language Models
Reduces performance loss during model pruning
Enhances language modeling and downstream task performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Utilizes decoder-block-level regional gradients
Proposes efficient regional optimization method
Lightweight pruning on single GPU
🔎 Similar Papers
No similar papers found.
Y
Yifan Yang
University of California, Santa Barbara
Kai Zhen
Kai Zhen
Amazon, IUB
Speech & AudioEdge AICompressionLLM Training
B
Bhavana Ganesh
Amazon AGI
A
A. Galstyan
Amazon AGI
Goeric Huybrechts
Goeric Huybrechts
Amazon
Artificial IntelligenceMachine Learning
Markus Muller
Markus Muller
Amazon AGI
J
Jonas M. Kubler
Amazon AGI
R
R. Swaminathan
Amazon AGI
Athanasios Mouchtaris
Athanasios Mouchtaris
Senior Applied Science Manager at Amazon
S
S. Bodapati
Amazon AGI
Nathan Susanj
Nathan Susanj
Applied Science Manager, Amazon
machine learningdeep learningnatural language processingautomatic speech recognition
Z
Zheng Zhang
University of California, Santa Barbara
J
Jack FitzGerald
Amazon AGI
A
Abhishek Kumar
Amazon AGI