A Sliding Layer Merging Method for Efficient Depth-Wise Pruning in LLMs

📅 2025-02-26

📈 Citations: 0

✨ Influential: 0

career value

204K/year

🤖 AI Summary

Conventional layer-wise pruning of large language models (LLMs) causes severe performance degradation due to coarse-grained, fixed-layer removal. Method: We propose sliding-layer fusion—a fine-grained structural simplification technique that dynamically merges adjacent layers based on output similarity measured in a reproducing kernel Hilbert space (RKHS), coupled with an adaptive similarity threshold mechanism to replace rigid layer pruning. Contribution/Results: We are the first to empirically uncover a “patchwise” feature complementarity among LLM layers. Our depth-width co-pruning framework achieves 35% depth compression on Vicuna-7B while improving zero-shot average accuracy by +1.654% over the baseline—without retraining. Post-compression fine-tuning yields significantly better recovery than state-of-the-art methods. The approach demonstrates strong generalizability across architectures and model scales.

Technology Category

Application Category

📝 Abstract

Compared to width-wise pruning, depth-wise pruning can significantly accelerate inference in resource-constrained scenarios. Howerver, treating the entire Transformer layer as the minimum pruning unit may degrade model performance by indiscriminately discarding the entire information of the layer. This paper reveals the"Patch-like"feature relationship between layers in large language models by analyzing the correlation of the outputs of different layers in the reproducing kernel Hilbert space. Building on this observation, we proposes a sliding layer merging method that dynamically selects and fuses consecutive layers from top to bottom according to a pre-defined similarity threshold, thereby simplifying the model structure while maintaining its performance. Extensive experiments on LLMs with various architectures and different parameter scales show that our method outperforms existing pruning techniques in both zero-shot inference performance and retraining recovery quality after pruning. In particular, in the experiment with 35% pruning on the Vicuna-7B model, our method achieved a 1.654% improvement in average performance on zero-shot tasks compared to the existing method. Moreover, we further reveal the potential of combining depth pruning with width pruning to enhance the pruning effect. Our codes are available at https://github.com/920927/SLM-a-sliding-layer-merging-method.

Problem

Research questions and friction points this paper is trying to address.

Efficient depth-wise pruning in LLMs

Dynamic layer merging for performance maintenance

Combining depth and width pruning enhancement

Innovation

Methods, ideas, or system contributions that make the work stand out.

Sliding layer merging technique

Depth-wise pruning optimization

Transformer layer dynamic fusion

🔎 Similar Papers

BlockPruner: Fine-grained Pruning for Large Language Models