FlattenGPT: Depth Compression for Transformer with Layer Flattening

๐Ÿ“… 2026-02-09
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This work addresses critical limitations in existing Transformer compression techniques, which either discard essential information through block-level pruning or fail to reduce model depth with inconsistent channel-wise sparsity across layers. To overcome these issues, the authors propose a โ€œlayer flatteningโ€ mechanism that merges every two adjacent Transformer blocks into one, thereby effectively compressing model depth while preserving architectural consistency. By integrating parameter redundancy detection with structured pruning, the method achieves a 20% reduction in depth on LLaMA-2/3 and Qwen-1.5 models, retaining 90โ€“96% of zero-shot performance. The compressed models outperform prior pruning approaches in both zero-shot accuracy and WikiText-2 perplexity, demonstrating significant gains in inference efficiency for large language models.

Technology Category

Application Category

๐Ÿ“ Abstract
Recent works have indicated redundancy across transformer blocks, prompting the research of depth compression to prune less crucial blocks. However, current ways of entire-block pruning suffer from risks of discarding meaningful cues learned in those blocks, leading to substantial performance degradation. As another line of model compression, channel pruning can better preserve performance, while it cannot reduce model depth and is challenged by inconsistent pruning ratios for individual layers. To pursue better model compression and acceleration, this paper proposes \textbf{FlattenGPT}, a novel way to detect and reduce depth-wise redundancies. By flatting two adjacent blocks into one, it compresses the network depth, meanwhile enables more effective parameter redundancy detection and removal. FlattenGPT allows to preserve the knowledge learned in all blocks, and remains consistent with the original transformer architecture. Extensive experiments demonstrate that FlattenGPT enhances model efficiency with a decent trade-off to performance. It outperforms existing pruning methods in both zero-shot accuracies and WikiText-2 perplexity across various model types and parameter sizes. On LLaMA-2/3 and Qwen-1.5 models, FlattenGPT retains 90-96\% of zero-shot performance with a compression ratio of 20\%. It also outperforms other pruning methods in accelerating LLM inference, making it promising for enhancing the efficiency of transformers.
Problem

Research questions and friction points this paper is trying to address.

depth compression
transformer redundancy
model pruning
performance degradation
layer depth reduction
Innovation

Methods, ideas, or system contributions that make the work stand out.

depth compression
layer flattening
transformer pruning
model acceleration
redundancy removal
R
Ruihan Xu
State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University
Qingpei Guo
Qingpei Guo
Ant Group
Multimodal LLMsVision-Language Models
Y
Yao Zhu
Tsinghua University
X
Xiangyang Ji
Tsinghua University
M
Ming Yang
AntGroup
Shiliang Zhang
Shiliang Zhang
Department of Computer Science, School of EECS, Peking University
Multimedia Information RetrievalMultimedia SystemsVisual Search