Diversity-Guided MLP Reduction for Efficient Large Vision Transformers

📅 2025-06-10

📈 Citations: 0

✨ Influential: 0

career value

162K/year

🤖 AI Summary

To address severe parameter and computational redundancy in MLP modules of large-scale vision Transformers, this paper proposes a diversity-guided MLP compression method. Our approach introduces a novel Gram–Schmidt orthogonalization-based weight pruning strategy that explicitly preserves weight diversity while sparsifying neurons. It further integrates unsupervised, few-shot, label-free knowledge distillation with structural compression of MLP hidden layers to achieve joint optimization of parameters and FLOPs. On EVA-CLIP-E (4.4B), our method achieves 71.5% reduction in both parameter count and FLOPs without any accuracy loss. Across multiple models, the average compression rate reaches 57.0%, and full recovery of original performance requires only 0.06% of the LAION-2B dataset—effectively overcoming the notorious post-pruning accuracy collapse.

Technology Category

Application Category

📝 Abstract

Transformer models achieve excellent scaling property, where the performance is improved with the increment of model capacity. However, large-scale model parameters lead to an unaffordable cost of computing and memory. We analyze popular transformer architectures and find that multilayer perceptron (MLP) modules take up the majority of model parameters. To this end, we focus on the recoverability of the compressed models and propose a Diversity-Guided MLP Reduction (DGMR) method to significantly reduce the parameters of large vision transformers with only negligible performance degradation. Specifically, we conduct a Gram-Schmidt weight pruning strategy to eliminate redundant neurons of MLP hidden layer, while preserving weight diversity for better performance recover during distillation. Compared to the model trained from scratch, our pruned model only requires 0.06% data of LAION-2B (for the training of large vision transformers) without labels (ImageNet-1K) to recover the original performance. Experimental results on several state-of-the-art large vision transformers demonstrate that our method achieves a more than 57.0% parameter and FLOPs reduction in a near lossless manner. Notably, for EVA-CLIP-E (4.4B), our method accomplishes a 71.5% parameter and FLOPs reduction without performance degradation. The source code and trained weights are available at https://github.com/visresearch/DGMR.

Problem

Research questions and friction points this paper is trying to address.

Reduce MLP parameters in large vision transformers

Maintain performance with minimal data recovery

Achieve significant parameter and FLOPs reduction

Innovation

Methods, ideas, or system contributions that make the work stand out.

Diversity-Guided MLP Reduction for parameter efficiency

Gram-Schmidt pruning to eliminate redundant MLP neurons

Minimal data needed for performance recovery post-pruning

🔎 Similar Papers

Why Lift so Heavy? Slimming Large Language Models by Cutting Off the Layers