ResMoE: Space-efficient Compression of Mixture of Experts LLMs via Residual Restoration

📅 2025-03-10

📈 Citations: 0

✨ Influential: 0

career value

202K/year

🤖 AI Summary

MoE Transformers incur substantial GPU memory overhead during inference due to the need to load all parameters, resulting in low spatial efficiency. To address this, we propose a one-shot compression framework that requires no fine-tuning and is independent of training data. Our method leverages the Wasserstein barycenter to extract structural commonalities across experts and explicitly models residual components to achieve high-fidelity approximation. This work is the first to introduce the Wasserstein barycenter into MoE compression, enabling clean decoupling of shared structure and expert-specific residuals—thereby circumventing the adaptation bottlenecks of conventional pruning and quantization methods on inherently sparse MoE architectures. Evaluated on Switch Transformer, Mixtral, and DeepSeekMoE, our approach reduces per-expert parameter count by 75% while preserving near-original inference accuracy, significantly improving GPU memory utilization and practical deployment feasibility.

Technology Category

Application Category

📝 Abstract

Mixture-of-Experts (MoE) Transformer, the backbone architecture of multiple phenomenal language models, leverages sparsity by activating only a fraction of model parameters for each input token. The sparse structure, while allowing constant time costs, results in space inefficiency: we still need to load all the model parameters during inference. We introduce ResMoE, an innovative MoE approximation framework that utilizes Wasserstein barycenter to extract a common expert (barycenter expert) and approximate the residuals between this barycenter expert and the original ones. ResMoE enhances the space efficiency for inference of large-scale MoE Transformers in a one-shot and data-agnostic manner without retraining while maintaining minimal accuracy loss, thereby paving the way for broader accessibility to large language models. We demonstrate the effectiveness of ResMoE through extensive experiments on Switch Transformer, Mixtral, and DeepSeekMoE models. The results show that ResMoE can reduce the number of parameters in an expert by up to 75% while maintaining comparable performance. The code is available at https://github.com/iDEA-iSAIL-Lab-UIUC/ResMoE.

Problem

Research questions and friction points this paper is trying to address.

Enhances space efficiency in MoE Transformers

Reduces model parameters without retraining

Maintains performance with minimal accuracy loss

Innovation

Methods, ideas, or system contributions that make the work stand out.

Utilizes Wasserstein barycenter for expert extraction

Approximates residuals to enhance space efficiency

Reduces parameters by 75% with minimal accuracy loss

🔎 Similar Papers

Demystifying the Compression of Mixture-of-Experts Through a Unified Framework