A Simple Linear Patch Revives Layer-Pruned Large Language Models

📅 2025-05-30

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

To address the significant performance degradation of large language models (LLMs) after layer pruning, this paper proposes LinearPatch—a linear patching technique that jointly calibrates cross-layer and cross-token activation magnitudes at pruned layer interfaces. Innovatively, it incorporates a Hadamard transform to suppress interference from outlier tokens and integrates channel-wise scaling with matrix fusion for efficient alignment. LinearPatch is plug-and-play and incurs zero inference overhead. Further combined with a memory-efficient offline lightweight knowledge distillation—requiring only 5K samples and 30 minutes on a single GPU—the method achieves 95.16% of the original performance on question-answering tasks after pruning five layers from LLaMA-3-8B, outperforming the state-of-the-art by 4 percentage points.

Technology Category

Application Category

📝 Abstract

Layer pruning has become a popular technique for compressing large language models (LLMs) due to its simplicity. However, existing layer pruning methods often suffer from significant performance drops. We identify that this degradation stems from the mismatch of activation magnitudes across layers and tokens at the pruning interface. To address this, we propose LinearPatch, a simple plug-and-play technique to revive the layer-pruned LLMs. The proposed method adopts Hadamard transformation to suppress massive outliers in particular tokens, and channel-wise scaling to align the activation magnitudes. These operations can be fused into a single matrix, which functions as a patch to bridge the pruning interface with negligible inference overhead. LinearPatch retains up to 94.15% performance of the original model when pruning 5 layers of LLaMA-3-8B on the question answering benchmark, surpassing existing state-of-the-art methods by 4%. In addition, the patch matrix can be further optimized with memory efficient offline knowledge distillation. With only 5K samples, the retained performance of LinearPatch can be further boosted to 95.16% within 30 minutes on a single computing card.

Problem

Research questions and friction points this paper is trying to address.

Addresses performance drop in layer-pruned LLMs

Mitigates activation magnitude mismatch across layers

Enhances pruned model performance with minimal overhead

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hadamard transformation suppresses activation outliers

Channel-wise scaling aligns activation magnitudes

Single matrix patch bridges pruning interface

🔎 Similar Papers

No similar papers found.

Authors to Follow