MC-MoE: Mixture Compressor for Mixture-of-Experts LLMs Gains More

📅 2024-10-08

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

224K/year

🤖 AI Summary

MoE-LLMs suffer from high memory overhead and inference latency due to redundant expert parameters. To address this, we propose MC, a training-free hybrid compression framework that jointly models expert importance and token importance for co-optimized storage and inference efficiency. Methodologically: (1) it couples expert-level importance estimation with token-level dynamic routing; (2) integrates preloading mixed-precision quantization (formulated as a linear programming problem) with online dynamic pruning; and (3) employs LP-driven adaptive quantization, importance scoring based on routing probabilities and activation frequency, and token-aware dynamic expert selection. MC compresses 76.6% of parameters to 2.54 bits per parameter, incurring only a 3.8% average accuracy drop. During inference, it further reduces activated parameters by 15%, with negligible performance degradation (<0.6%). This is the first work to unify expert- and token-level importance modeling for efficient MoE-LLM deployment without fine-tuning.

Technology Category

Application Category

📝 Abstract

Mixture-of-Experts large language models (MoE-LLMs) marks a significant step forward of language models, however, they encounter two critical challenges in practice: 1) expert parameters lead to considerable memory consumption and loading latency; and 2) the current activated experts are redundant, as many tokens may only require a single expert. Motivated by these issues, we investigate the MoE-LLMs and make two key observations: a) different experts exhibit varying behaviors on activation reconstruction error, routing scores, and activated frequencies, highlighting their differing importance, and b) not all tokens are equally important -- only a small subset is critical. Building on these insights, we propose MC, a training-free Mixture-Compressor for MoE-LLMs, which leverages the significance of both experts and tokens to achieve an extreme compression. First, to mitigate storage and loading overheads, we introduce Pre-Loading Mixed-Precision Quantization, which formulates the adaptive bit-width allocation as a Linear Programming problem, where the objective function balances multi-factors reflecting the importance of each expert. Additionally, we develop Online Dynamic Pruning, which identifies important tokens to retain and dynamically select activated experts for other tokens during inference to optimize efficiency while maintaining performance. Our MC integrates static quantization and dynamic pruning to collaboratively achieve extreme compression for MoE-LLMs with less accuracy loss, ensuring an optimal trade-off between performance and efficiency. Extensive experiments confirm the effectiveness of our approach. For instance, at 2.54 bits, MC compresses 76.6% of the model, with only a 3.8% average accuracy loss. During dynamic inference, we further reduce activated parameters by 15%, with a performance drop of less than 0.6%.

Problem

Research questions and friction points this paper is trying to address.

Reduce memory consumption in MoE-LLMs

Minimize expert redundancy in token processing

Optimize compression with minimal accuracy loss

Innovation

Methods, ideas, or system contributions that make the work stand out.

Pre-Loading Mixed-Precision Quantization for storage efficiency

Online Dynamic Pruning to optimize token importance

Combines static quantization and dynamic pruning techniques

🔎 Similar Papers

Position IDs Matter: An Enhanced Position Layout for Efficient Context Compression in Large Language Models