Layer as Puzzle Pieces: Compressing Large Language Models through Layer Concatenation

📅 2025-10-17

📈 Citations: 0

✨ Influential: 0

career value

182K/year

🤖 AI Summary

Large language models (LLMs) face severe computational and memory bottlenecks due to their enormous parameter counts; existing structured pruning methods often incur substantial performance degradation, weak inter-layer information aggregation, and lack effective recovery mechanisms. To address these issues, we propose a novel compression framework: first, a channel sensitivity metric enables cross-layer identification and concatenative fusion of critical channels; second, a layer-correspondence-driven hierarchical knowledge distillation mechanism facilitates progressive capability recovery post-pruning. Our approach integrates channel selection, weight concatenation, structured pruning, and joint analysis of activation and weight norms. Evaluated on seven benchmarks, it significantly outperforms state-of-the-art pruning methods. When pruning 30% of parameters from LLaMA-2-7B, the compressed model retains 83% average accuracy—marking the first instance of high-fidelity LLM compression that preserves strong generative capability.

Technology Category

Application Category

📝 Abstract

Large Language Models excel at natural language processing tasks, but their massive size leads to high computational and storage demands. Recent works have sought to reduce their model size through layer-wise structured pruning. However, they tend to ignore retaining the capabilities in the pruned part. In this work, we re-examine structured pruning paradigms and uncover several key limitations: 1) notable performance degradation due to direct layer removal, 2) incompetent linear weight layer aggregation, and 3) the lack of effective post-training recovery mechanisms. To address these limitations, we propose CoMe, including a progressive layer pruning framework with a Concatenation-based Merging technology and a hierarchical distillation post-training process. Specifically, we introduce a channel sensitivity metric that utilizes activation intensity and weight norms for fine-grained channel selection. Subsequently, we employ a concatenation-based layer merging method to fuse the most critical channels across adjacent layers, enabling progressive model size reduction. Finally, we propose a hierarchical distillation protocol that leverages the correspondences between the original and pruned model layers established during pruning, thereby enabling efficient knowledge transfer. Experiments on seven benchmarks show that CoMe achieves state-of-the-art performance; when pruning 30% of LLaMA-2-7b's parameters, the pruned model retains 83% of its original average accuracy. Our code is available at https://github.com/MPI-Lab/CoMe.

Problem

Research questions and friction points this paper is trying to address.

Reducing computational and storage demands of large language models

Addressing performance degradation from direct layer removal

Improving layer merging and post-training recovery mechanisms

Innovation

Methods, ideas, or system contributions that make the work stand out.

Progressive layer pruning with channel sensitivity metric

Concatenation-based merging of critical adjacent layers

Hierarchical distillation using original-pruned layer correspondences

🔎 Similar Papers

Streamlining Redundant Layers to Compress Large Language Models