A Sliding Layer Merging Method for Efficient Depth-Wise Pruning in LLMs

📅 2025-02-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Conventional layer-wise pruning of large language models (LLMs) causes severe performance degradation due to coarse-grained, fixed-layer removal. Method: We propose sliding-layer fusion—a fine-grained structural simplification technique that dynamically merges adjacent layers based on output similarity measured in a reproducing kernel Hilbert space (RKHS), coupled with an adaptive similarity threshold mechanism to replace rigid layer pruning. Contribution/Results: We are the first to empirically uncover a “patchwise” feature complementarity among LLM layers. Our depth-width co-pruning framework achieves 35% depth compression on Vicuna-7B while improving zero-shot average accuracy by +1.654% over the baseline—without retraining. Post-compression fine-tuning yields significantly better recovery than state-of-the-art methods. The approach demonstrates strong generalizability across architectures and model scales.

Technology Category

Application Category

📝 Abstract
Compared to width-wise pruning, depth-wise pruning can significantly accelerate inference in resource-constrained scenarios. Howerver, treating the entire Transformer layer as the minimum pruning unit may degrade model performance by indiscriminately discarding the entire information of the layer. This paper reveals the"Patch-like"feature relationship between layers in large language models by analyzing the correlation of the outputs of different layers in the reproducing kernel Hilbert space. Building on this observation, we proposes a sliding layer merging method that dynamically selects and fuses consecutive layers from top to bottom according to a pre-defined similarity threshold, thereby simplifying the model structure while maintaining its performance. Extensive experiments on LLMs with various architectures and different parameter scales show that our method outperforms existing pruning techniques in both zero-shot inference performance and retraining recovery quality after pruning. In particular, in the experiment with 35% pruning on the Vicuna-7B model, our method achieved a 1.654% improvement in average performance on zero-shot tasks compared to the existing method. Moreover, we further reveal the potential of combining depth pruning with width pruning to enhance the pruning effect. Our codes are available at https://github.com/920927/SLM-a-sliding-layer-merging-method.
Problem

Research questions and friction points this paper is trying to address.

Efficient depth-wise pruning in LLMs
Dynamic layer merging for performance maintenance
Combining depth and width pruning enhancement
Innovation

Methods, ideas, or system contributions that make the work stand out.

Sliding layer merging technique
Depth-wise pruning optimization
Transformer layer dynamic fusion
X
Xuan Ding
Beijing Normal University
Y
Yao Zhu
Zhejiang University
Y
Yunjian Zhang
University of Chinese Academy of Sciences
Chuanlong Xie
Chuanlong Xie
Beijing Normal University