The Structural Scalpel: Automated Contiguous Layer Pruning for Large Language Models

📅 2025-10-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Layer pruning of large language models (LLMs) on edge devices often degrades performance sharply due to neglect of inter-layer dependencies. Method: This paper proposes an automated, continuous-layer pruning framework featuring: (i) a differentiable concave gating mechanism for gradient-based identification of optimal contiguous layer segments; (ii) fine-tuning of layers adjacent to pruning endpoints to jointly restore information flow; and (iii) seamless integration with quantization for joint compression. Results: On LLaMA3-70B, the method retains 95.34% of original accuracy at 20% layer pruning—outperforming state-of-the-art approaches by 4.29–30.52%. It is the first to unify continuous architecture search, differentiable pruning, and localized fine-tuning within a single end-to-end optimization paradigm, achieving superior trade-offs among accuracy, inference efficiency, and hardware compatibility.

Technology Category

Application Category

📝 Abstract
Although large language models (LLMs) have achieved revolutionary breakthroughs in many fields, their large model size and high computational cost pose significant challenges for practical deployment on resource-constrained edge devices. To this end, layer pruning has been proposed to reduce the computational overhead by directly removing redundant layers. However, existing layer pruning methods typically rely on hand-crafted metrics to evaluate and remove individual layers, while ignoring the dependencies between layers. This can disrupt the model's information flow and severely degrade performance. To address these issues, we propose CLP, a novel continuous layer pruning framework that introduces two key innovations: a differentiable concave gate algorithm that automatically identifies the best continuous layer segments for pruning via gradient-based optimization; and a cutoff endpoint tuning strategy that effectively restores model performance by fine-tuning only the layers adjacent to the pruned segments. Extensive experiments across multiple model architectures (including LLaMA2, LLaMA3 and Qwen) and sizes (from $7$B to $70$B parameters) show that CLP significantly outperforms existing state-of-the-art baselines. For example, at a pruning rate of $20%$, CLP achieves an average performance retention of $95.34%$ on LLaMA3-70B, outperforming baselines by $4.29%$-$30.52%$. Furthermore, CLP can be seamlessly combined with quantization to further compress the model with only a slight performance loss.
Problem

Research questions and friction points this paper is trying to address.

Automated pruning of contiguous layers in large language models
Addressing layer dependencies ignored by existing pruning methods
Reducing computational costs while maintaining high model performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Differentiable concave gate algorithm identifies pruning segments
Cutoff endpoint tuning strategy restores model performance
Automated contiguous layer pruning framework for large models
Y
Yao Lu
Institute of Cyberspace Security, College of Information Engineering, Zhejiang University of Technology, Hangzhou 310023, China, with the Binjiang Institute of Artificial Intelligence, Zhejiang University of Technology, Hangzhou 310056, China, also with the Centre for Frontier AI Research, Agency for Science, Technology and Research, Singapore 138632
Y
Yuqi Li
City College of New York, City University of New York, USA
W
Wenbin Xie
Institute of Cyberspace Security, College of Information Engineering, Zhejiang University of Technology, Hangzhou 310023, China, also with the Binjiang Institute of Artificial Intelligence, Zhejiang University of Technology, Hangzhou 310056, China
S
Shanqing Yu
Institute of Cyberspace Security, College of Information Engineering, Zhejiang University of Technology, Hangzhou 310023, China, also with the Binjiang Institute of Artificial Intelligence, Zhejiang University of Technology, Hangzhou 310056, China
Qi Xuan
Qi Xuan
Professor, Zhejiang University of Technology
AI SecuritySocial NetworkDeep LearningData Mining
Zhaowei Zhu
Zhaowei Zhu
Docta.ai; University of California, Santa Cruz
Machine learningData QualityLabel NoiseResponsible AI
Shiping Wen
Shiping Wen
Professor, FInstP, FBCS, University of Technology Sydney
neural networkmemristormachine learningsafety-critical control