Demystifying the Compression of Mixture-of-Experts Through a Unified Framework

📅 2024-06-04

🏛️ arXiv.org

📈 Citations: 10

✨ Influential: 1

career value

188K/year

🤖 AI Summary

To address efficiency bottlenecks in large MoE model deployment—stemming from parameter redundancy and communication overhead—this paper proposes the first unified compression framework tailored for MoE architectures, integrating dual complementary pathways: Expert Slimming and Expert Trimming. We innovatively incorporate aggressive structured pruning techniques, including Layer Drop and Block Drop, synergistically combined with sparse activation optimization and multi-granularity parameter compression, yielding an end-to-end reusable MoE compression paradigm. Evaluated on Mixtral-8x7B, our method achieves a 6.05× inference speedup, reduces GPU memory footprint to 20.0 GB, and retains over 92% of performance on key downstream tasks—substantially outperforming existing MoE-specific compression approaches.

Technology Category

Application Category

📝 Abstract

Scaling large language models has revolutionized the performance across diverse domains, yet the continual growth in model size poses significant challenges for real-world deployment. The Mixture of Experts (MoE) approach addresses this by dynamically selecting and activating only a subset of experts, significantly reducing computational costs while maintaining high performance. However, MoE introduces potential redundancy (e.g., parameters) and extra costs (e.g., communication overhead). Despite numerous compression techniques developed for mitigating the redundancy in dense models, the compression of MoE remains under-explored. We first bridge this gap with a cutting-edge unified framework that not only seamlessly integrates mainstream compression methods but also helps systematically understand MoE compression. This framework approaches compression from two perspectives: Expert Slimming which compresses individual experts and Expert Trimming which removes structured modules. Within this framework, we explore the optimization space unexplored by existing methods,and further introduce aggressive Expert Trimming techniques, i.e., Layer Drop and Block Drop, to eliminate redundancy at larger scales. Based on these insights,we present a comprehensive recipe to guide practitioners in compressing MoE effectively. Extensive experimental results demonstrate the effectiveness of the compression methods under our framework and the proposed recipe, achieving a 6.05x speedup and only 20.0GB memory usage while maintaining over 92% of performance on Mixtral-8x7B. Code is released at url{https://github.com/DaizeDong/Unified-MoE-Compression}.

Problem

Research questions and friction points this paper is trying to address.

Addresses inefficiencies in Mixture of Experts (MoE) architecture.

Proposes compression techniques to reduce computational and memory costs.

Enhances efficiency while maintaining high model performance.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Layer Drop removes entire MoE layers

Block Drop eliminates transformer blocks

Expert Slimming compresses individual experts

🔎 Similar Papers

On Expert Estimation in Hierarchical Mixture of Experts: Beyond Softmax Gating Functions