A Simple Linear Patch Revives Layer-Pruned Large Language Models

📅 2025-05-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the significant performance degradation of large language models (LLMs) after layer pruning, this paper proposes LinearPatch—a linear patching technique that jointly calibrates cross-layer and cross-token activation magnitudes at pruned layer interfaces. Innovatively, it incorporates a Hadamard transform to suppress interference from outlier tokens and integrates channel-wise scaling with matrix fusion for efficient alignment. LinearPatch is plug-and-play and incurs zero inference overhead. Further combined with a memory-efficient offline lightweight knowledge distillation—requiring only 5K samples and 30 minutes on a single GPU—the method achieves 95.16% of the original performance on question-answering tasks after pruning five layers from LLaMA-3-8B, outperforming the state-of-the-art by 4 percentage points.

Technology Category

Application Category

📝 Abstract
Layer pruning has become a popular technique for compressing large language models (LLMs) due to its simplicity. However, existing layer pruning methods often suffer from significant performance drops. We identify that this degradation stems from the mismatch of activation magnitudes across layers and tokens at the pruning interface. To address this, we propose LinearPatch, a simple plug-and-play technique to revive the layer-pruned LLMs. The proposed method adopts Hadamard transformation to suppress massive outliers in particular tokens, and channel-wise scaling to align the activation magnitudes. These operations can be fused into a single matrix, which functions as a patch to bridge the pruning interface with negligible inference overhead. LinearPatch retains up to 94.15% performance of the original model when pruning 5 layers of LLaMA-3-8B on the question answering benchmark, surpassing existing state-of-the-art methods by 4%. In addition, the patch matrix can be further optimized with memory efficient offline knowledge distillation. With only 5K samples, the retained performance of LinearPatch can be further boosted to 95.16% within 30 minutes on a single computing card.
Problem

Research questions and friction points this paper is trying to address.

Addresses performance drop in layer-pruned LLMs
Mitigates activation magnitude mismatch across layers
Enhances pruned model performance with minimal overhead
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hadamard transformation suppresses activation outliers
Channel-wise scaling aligns activation magnitudes
Single matrix patch bridges pruning interface
🔎 Similar Papers
No similar papers found.
Xinrui Chen
Xinrui Chen
Tsinghua University
Efficient Deep LearningComputer Vision
Haoli Bai
Haoli Bai
Huawei Technologies
natural language processingmodel compression
Tao Yuan
Tao Yuan
University of California, Los Angeles
Computer VisionArtificial Intelligence
R
Ruikang Liu
Shenzhen International Graduate School, Tsinghua University
K
Kang Zhao
Huawei Noah’s Ark Lab
Xianzhi Yu
Xianzhi Yu
Unknown affiliation
AIHPC
L
Lu Hou
Huawei Noah’s Ark Lab
T
Tian Guan
Shenzhen International Graduate School, Tsinghua University
Yonghong He
Yonghong He
清华大学深圳国际研究生院
生物医学工程,光学成像,AI图像处理、病理大模型
C
Chun Yuan
Shenzhen International Graduate School, Tsinghua University