From Local to Global: Revisiting Structured Pruning Paradigms for Large Language Models

📅 2025-10-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current structured pruning predominantly adopts task-agnostic, layer-wise reconstruction paradigms, which fail to leverage downstream task signals and thus yield limited accuracy gains—especially on decision-making tasks—after compression. To address this, we propose a global iterative structured pruning framework driven by model-level task loss. It defines module-level importance scores for attention heads and MLP channels, and employs first-order gradient estimation, block-wise normalization, and nested subnetwork iteration to achieve high sparsity without fine-tuning. This approach overcomes the limitations of local reconstruction by enabling task-aligned pruning. Evaluated on Llama2/3 and Mistral models, it achieves 40–50% sparsity while significantly reducing WikiText-2 perplexity and substantially improving accuracy on downstream benchmarks such as GSM8K.

Technology Category

Application Category

📝 Abstract
Structured pruning is a practical approach to deploying large language models (LLMs) efficiently, as it yields compact, hardware-friendly architectures. However, the dominant local paradigm is task-agnostic: by optimizing layer-wise reconstruction rather than task objectives, it tends to preserve perplexity or generic zero-shot behavior but fails to capitalize on modest task-specific calibration signals, often yielding limited downstream gains. We revisit global structured pruning and present GISP-Global Iterative Structured Pruning-a post-training method that removes attention heads and MLP channels using first-order, loss-based important weights aggregated at the structure level with block-wise normalization. An iterative schedule, rather than one-shot pruning, stabilizes accuracy at higher sparsity and mitigates perplexity collapse without requiring intermediate fine-tuning; the pruning trajectory also forms nested subnetworks that support a "prune-once, deploy-many" workflow. Furthermore, because importance is defined by a model-level loss, GISP naturally supports task-specific objectives; we instantiate perplexity for language modeling and a margin-based objective for decision-style tasks. Extensive experiments show that across Llama2-7B/13B, Llama3-8B, and Mistral-0.3-7B, GISP consistently lowers WikiText-2 perplexity and improves downstream accuracy, with especially strong gains at 40-50% sparsity; on DeepSeek-R1-Distill-Llama-3-8B with GSM8K, task-aligned calibration substantially boosts exact-match accuracy.
Problem

Research questions and friction points this paper is trying to address.

Local pruning preserves perplexity but fails to utilize task-specific signals effectively
Global pruning needs to maintain accuracy at high sparsity without fine-tuning
Existing methods lack task-aligned optimization for improved downstream performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Global structured pruning removes heads and channels
Iterative pruning stabilizes accuracy without fine-tuning
Task-specific objectives boost performance for specialized tasks
🔎 Similar Papers
No similar papers found.
Z
Ziyan Wang
University of North Carolina at Charlotte
Enmao Diao
Enmao Diao
Ph.D.
Distributed Machine LearningEfficient Machine LearningSignal Processing
Q
Qi Le
University of Minnesota
P
Pu Wang
University of North Carolina at Charlotte
M
Minwoo Lee
University of North Carolina at Charlotte
Shu-ping Yeh
Shu-ping Yeh
Intel Corporation
RAN-AIMulti-RAT NetworksFull-Duplex SystemFemtocellsMulti-tier Networks
E
Evgeny Stupachenko
Intel Corporation
H
Hao Feng
Intel Corporation
L
Li Yang
University of North Carolina at Charlotte