🤖 AI Summary
To address severe parameter and computational redundancy in MLP modules of large-scale vision Transformers, this paper proposes a diversity-guided MLP compression method. Our approach introduces a novel Gram–Schmidt orthogonalization-based weight pruning strategy that explicitly preserves weight diversity while sparsifying neurons. It further integrates unsupervised, few-shot, label-free knowledge distillation with structural compression of MLP hidden layers to achieve joint optimization of parameters and FLOPs. On EVA-CLIP-E (4.4B), our method achieves 71.5% reduction in both parameter count and FLOPs without any accuracy loss. Across multiple models, the average compression rate reaches 57.0%, and full recovery of original performance requires only 0.06% of the LAION-2B dataset—effectively overcoming the notorious post-pruning accuracy collapse.
📝 Abstract
Transformer models achieve excellent scaling property, where the performance is improved with the increment of model capacity. However, large-scale model parameters lead to an unaffordable cost of computing and memory. We analyze popular transformer architectures and find that multilayer perceptron (MLP) modules take up the majority of model parameters. To this end, we focus on the recoverability of the compressed models and propose a Diversity-Guided MLP Reduction (DGMR) method to significantly reduce the parameters of large vision transformers with only negligible performance degradation. Specifically, we conduct a Gram-Schmidt weight pruning strategy to eliminate redundant neurons of MLP hidden layer, while preserving weight diversity for better performance recover during distillation. Compared to the model trained from scratch, our pruned model only requires 0.06% data of LAION-2B (for the training of large vision transformers) without labels (ImageNet-1K) to recover the original performance. Experimental results on several state-of-the-art large vision transformers demonstrate that our method achieves a more than 57.0% parameter and FLOPs reduction in a near lossless manner. Notably, for EVA-CLIP-E (4.4B), our method accomplishes a 71.5% parameter and FLOPs reduction without performance degradation. The source code and trained weights are available at https://github.com/visresearch/DGMR.