DLP: Dynamic Layerwise Pruning in Large Language Models

๐Ÿ“… 2025-05-27
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
To address the sharp performance degradation of large language models (LLMs) under high sparsity caused by fixed-layer pruning ratios, this paper proposes a dynamic inter-layer pruning method. Our approach introduces a novel weight-activation co-aware importance scoring mechanism that eliminates predefined sparsity constraints; instead, it adaptively allocates layer-wise pruning ratios through joint modeling of per-layer weight distributions and input activation statistics. The framework is lightweight, fully compatible with parameter-efficient fine-tuning (PEFT), and seamlessly integrates with mainstream compression techniques. Under 70% overall sparsity, our method reduces perplexity by 7.79 and improves average accuracy by 2.7% on LLaMA2-7Bโ€”outperforming existing state-of-the-art pruning approaches significantly.

Technology Category

Application Category

๐Ÿ“ Abstract
Pruning has recently been widely adopted to reduce the parameter scale and improve the inference efficiency of Large Language Models (LLMs). Mainstream pruning techniques often rely on uniform layerwise pruning strategies, which can lead to severe performance degradation at high sparsity levels. Recognizing the varying contributions of different layers in LLMs, recent studies have shifted their focus toward non-uniform layerwise pruning. However, these approaches often rely on pre-defined values, which can result in suboptimal performance. To overcome these limitations, we propose a novel method called Dynamic Layerwise Pruning (DLP). This approach adaptively determines the relative importance of each layer by integrating model weights with input activation information, assigning pruning rates accordingly. Experimental results show that DLP effectively preserves model performance at high sparsity levels across multiple LLMs. Specifically, at 70% sparsity, DLP reduces the perplexity of LLaMA2-7B by 7.79 and improves the average accuracy by 2.7% compared to state-of-the-art methods. Moreover, DLP is compatible with various existing LLM compression techniques and can be seamlessly integrated into Parameter-Efficient Fine-Tuning (PEFT). We release the code at https://github.com/ironartisan/DLP to facilitate future research.
Problem

Research questions and friction points this paper is trying to address.

Dynamic pruning for efficient Large Language Models
Adaptive layer importance for optimal sparsity performance
Compatibility with existing compression and fine-tuning techniques
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic layerwise pruning adapts to layer importance
Integrates model weights with input activation data
Compatible with existing LLM compression techniques
๐Ÿ”Ž Similar Papers
No similar papers found.
Yuli Chen
Yuli Chen
State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications, Beijing, China
B
Bo Cheng
State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications, Beijing, China
Jiale Han
Jiale Han
The Hong Kong University of Science and Technology
Natural Language Processing
Y
Yingying Zhang
State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications, Beijing, China
Y
Yingting Li
State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications, Beijing, China
S
Shuhao Zhang
State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications, Beijing, China