Merging Feed-Forward Sublayers for Compressed Transformers

📅 2025-01-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the tension between limited device memory and performance preservation in large model deployment, this paper proposes a functional redundancy compression method targeting Transformer feed-forward sublayers, departing from conventional pruning paradigms. Our core innovation lies in the first systematic alignment and fusion of feed-forward sublayers exhibiting high activation similarity, achieved via a joint weight-activation similarity metric for sublayer alignment and a weighted fusion strategy. We validate generalizability across diverse tasks—including language modeling, image classification, and machine translation. On ViT, our method removes 21% of parameters while retaining 99% of original accuracy; fusing over one-third of feed-forward sublayers yields accuracy on par with the full model—significantly outperforming strong pruning baselines. This work establishes a novel paradigm for efficient large-model lightweighting through functional redundancy exploitation rather than structural sparsification.

Technology Category

Application Category

📝 Abstract
With the rise and ubiquity of larger deep learning models, the need for high-quality compression techniques is growing in order to deploy these models widely. The sheer parameter count of these models makes it difficult to fit them into the memory constraints of different hardware. In this work, we present a novel approach to model compression by merging similar parameter groups within a model, rather than pruning away less important parameters. Specifically, we select, align, and merge separate feed-forward sublayers in Transformer models, and test our method on language modeling, image classification, and machine translation. With our method, we demonstrate performance comparable to the original models while combining more than a third of model feed-forward sublayers, and demonstrate improved performance over a strong layer-pruning baseline. For instance, we can remove over 21% of total parameters from a Vision Transformer, while maintaining 99% of its original performance. Additionally, we observe that some groups of feed-forward sublayers exhibit high activation similarity, which may help explain their surprising mergeability.
Problem

Research questions and friction points this paper is trying to address.

Model Compression
Deep Learning
Memory Efficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

Model Compression
Deep Learning
Component Merging
🔎 Similar Papers
No similar papers found.