Streamlining Redundant Layers to Compress Large Language Models

📅 2024-03-28
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
To address the inefficiency caused by layer redundancy in large language model (LLM) compression, this paper proposes a novel layer-pruning paradigm: ranking layers by the importance of hidden-state perturbations to identify and remove consecutive redundant layers with negligible impact, then replacing them with lightweight, learnable modules. Crucially, we introduce a “stability” metric—quantifying distributional shift in hidden states before and after compression—to complement conventional accuracy-based evaluation and better preserve representation fidelity. Our method integrates importance-driven consecutive-layer pruning, modular distillation-based replacement, and explicit stability modeling. Extensive experiments across multiple benchmark tasks demonstrate that our approach significantly outperforms existing pruning methods: it reduces parameter count by 20–40% while maintaining or even improving downstream task performance, and accelerates training by over 1.8×.

Technology Category

Application Category

📝 Abstract
This paper introduces LLM-Streamline, a pioneer work on layer pruning for large language models (LLMs). It is based on the observation that different layers have varying impacts on hidden states, enabling the identification of less important layers to be pruned.LLM-Streamline comprises two parts: layer pruning, which removes consecutive layers with the lowest importance based on target sparsity, and layer replacement, a novel module that trains a lightweight network to replace the pruned layers to mitigate performance loss. Additionally, a new metric called stability is proposed to address the limitations of the widely used accuracy metric in evaluating model compression. Experiments show that LLM-Streamline outperforms both previous and concurrent state-of-the-art pruning methods in terms of both performance and training efficiency.Our code is available at https://github.com/RUCKBReasoning/LLM-Streamline
Problem

Research questions and friction points this paper is trying to address.

Language Model Compression
Performance Preservation
Training Efficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM-Streamline
selective layer pruning
stability metric
🔎 Similar Papers
No similar papers found.
X
Xiaodong Chen
Renmin University of China, China
Y
Yuxuan Hu
Renmin University of China, China
J
Jing Zhang
Renmin University of China, China
Yanling Wang
Yanling Wang
Zhipu AI
Data MiningNatural Language Processing
Cuiping Li
Cuiping Li
Renmin University of China
Databasebig data analysis and mining
H
Hong Chen
Renmin University of China, China