T'yr-the-Pruner: Unlocking Accurate 50% Structural Pruning for LLMs via Global Sparsity Distribution Optimization

📅 2025-03-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Global sparsity distribution in structured pruning of large language models (LLMs) is often suboptimal, leading to significant accuracy degradation. Method: This paper proposes an end-to-end differentiable global structured pruning framework. Departing from local pruning—which ignores inter-layer dependencies—and conventional global pruning—which suffers from high computational overhead—we introduce a novel hypernetwork-based mechanism to explicitly model structural interdependencies across layers. We further propose an expected error accumulation estimation technique to enable coarse-to-fine iterative search, and break away from traditional uniform saliency scoring by supporting end-to-end joint optimization of pruning decisions and model weights. Results: On Llama-3.1-70B, our method achieves 50% structured pruning while retaining 97% of the original dense model’s performance—setting a new state-of-the-art for structured LLM pruning.

Technology Category

Application Category

📝 Abstract
Structural pruning enhances hardware-agnostic inference efficiency for large language models (LLMs) but often struggles to maintain performance. Local pruning performs efficient layer-by-layer compression but ignores global topology. Global pruning has the potential to find the optimal solution although resource-intensive. However, existing methods tend to rank structural saliency uniformly, ignoring inter-structure dependencies and failing to achieve end-to-end optimization. To address these limitations, we propose T'yr-the-Pruner, an efficient end-to-end search-based global structural pruning framework. This framework constructs a supernet by repeatedly applying local pruning across a range of sparsity ratios to each layer in an LLM, with the core goal of determining the optimal sparsity distribution under a target overall sparsity ratio. Concretely, we introduce an effective local pruning and an expectation error accumulation approach to improve supernet construction. Furthermore, we employ an iterative prune-and-search strategy with coarse-to-fine sparsity granularity to ensure efficient search convergence. Experimental results show that T'yr-the-Pruner achieves state-of-the-art structural pruning, retaining 97% of the dense model's performance while removing a challenging 50% of Llama-3.1-70B's parameters.
Problem

Research questions and friction points this paper is trying to address.

Optimizing global sparsity distribution for LLM structural pruning
Addressing performance loss in high-ratio (50%) LLM pruning
Improving end-to-end optimization of inter-structure dependencies in pruning
Innovation

Methods, ideas, or system contributions that make the work stand out.

End-to-end search-based global structural pruning framework
Local pruning with expectation error accumulation
Iterative prune-and-search strategy for efficient convergence
🔎 Similar Papers
No similar papers found.
G
Guanchen Li
Advanced Micro Devices, Inc.
Yixing Xu
Yixing Xu
AMD
machine learningdeep learning
Zeping Li
Zeping Li
Phd student in Fudan University, Financial Technology Group
LLM and KG
J
Ji Liu
Advanced Micro Devices, Inc.
X
Xuanwu Yin
Advanced Micro Devices, Inc.
D
Dong Li
Advanced Micro Devices, Inc.
E
E. Barsoum
Advanced Micro Devices, Inc.