From Local to Global: Revisiting Structured Pruning Paradigms for Large Language Models

📅 2025-10-20

📈 Citations: 0

✨ Influential: 0

career value

243K/year

🤖 AI Summary

Current structured pruning predominantly adopts task-agnostic, layer-wise reconstruction paradigms, which fail to leverage downstream task signals and thus yield limited accuracy gains—especially on decision-making tasks—after compression. To address this, we propose a global iterative structured pruning framework driven by model-level task loss. It defines module-level importance scores for attention heads and MLP channels, and employs first-order gradient estimation, block-wise normalization, and nested subnetwork iteration to achieve high sparsity without fine-tuning. This approach overcomes the limitations of local reconstruction by enabling task-aligned pruning. Evaluated on Llama2/3 and Mistral models, it achieves 40–50% sparsity while significantly reducing WikiText-2 perplexity and substantially improving accuracy on downstream benchmarks such as GSM8K.

Technology Category

Application Category

📝 Abstract

Structured pruning is a practical approach to deploying large language models (LLMs) efficiently, as it yields compact, hardware-friendly architectures. However, the dominant local paradigm is task-agnostic: by optimizing layer-wise reconstruction rather than task objectives, it tends to preserve perplexity or generic zero-shot behavior but fails to capitalize on modest task-specific calibration signals, often yielding limited downstream gains. We revisit global structured pruning and present GISP-Global Iterative Structured Pruning-a post-training method that removes attention heads and MLP channels using first-order, loss-based important weights aggregated at the structure level with block-wise normalization. An iterative schedule, rather than one-shot pruning, stabilizes accuracy at higher sparsity and mitigates perplexity collapse without requiring intermediate fine-tuning; the pruning trajectory also forms nested subnetworks that support a "prune-once, deploy-many" workflow. Furthermore, because importance is defined by a model-level loss, GISP naturally supports task-specific objectives; we instantiate perplexity for language modeling and a margin-based objective for decision-style tasks. Extensive experiments show that across Llama2-7B/13B, Llama3-8B, and Mistral-0.3-7B, GISP consistently lowers WikiText-2 perplexity and improves downstream accuracy, with especially strong gains at 40-50% sparsity; on DeepSeek-R1-Distill-Llama-3-8B with GSM8K, task-aligned calibration substantially boosts exact-match accuracy.

Problem

Research questions and friction points this paper is trying to address.

Local pruning preserves perplexity but fails to utilize task-specific signals effectively

Global pruning needs to maintain accuracy at high sparsity without fine-tuning

Existing methods lack task-aligned optimization for improved downstream performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Global structured pruning removes heads and channels

Iterative pruning stabilizes accuracy without fine-tuning

Task-specific objectives boost performance for specialized tasks

🔎 Similar Papers

No similar papers found.