🤖 AI Summary
MoE-LLMs suffer from high memory overhead and inference latency due to redundant expert parameters. To address this, we propose MC, a training-free hybrid compression framework that jointly models expert importance and token importance for co-optimized storage and inference efficiency. Methodologically: (1) it couples expert-level importance estimation with token-level dynamic routing; (2) integrates preloading mixed-precision quantization (formulated as a linear programming problem) with online dynamic pruning; and (3) employs LP-driven adaptive quantization, importance scoring based on routing probabilities and activation frequency, and token-aware dynamic expert selection. MC compresses 76.6% of parameters to 2.54 bits per parameter, incurring only a 3.8% average accuracy drop. During inference, it further reduces activated parameters by 15%, with negligible performance degradation (<0.6%). This is the first work to unify expert- and token-level importance modeling for efficient MoE-LLM deployment without fine-tuning.
📝 Abstract
Mixture-of-Experts large language models (MoE-LLMs) marks a significant step forward of language models, however, they encounter two critical challenges in practice: 1) expert parameters lead to considerable memory consumption and loading latency; and 2) the current activated experts are redundant, as many tokens may only require a single expert. Motivated by these issues, we investigate the MoE-LLMs and make two key observations: a) different experts exhibit varying behaviors on activation reconstruction error, routing scores, and activated frequencies, highlighting their differing importance, and b) not all tokens are equally important -- only a small subset is critical. Building on these insights, we propose MC, a training-free Mixture-Compressor for MoE-LLMs, which leverages the significance of both experts and tokens to achieve an extreme compression. First, to mitigate storage and loading overheads, we introduce Pre-Loading Mixed-Precision Quantization, which formulates the adaptive bit-width allocation as a Linear Programming problem, where the objective function balances multi-factors reflecting the importance of each expert. Additionally, we develop Online Dynamic Pruning, which identifies important tokens to retain and dynamically select activated experts for other tokens during inference to optimize efficiency while maintaining performance. Our MC integrates static quantization and dynamic pruning to collaboratively achieve extreme compression for MoE-LLMs with less accuracy loss, ensuring an optimal trade-off between performance and efficiency. Extensive experiments confirm the effectiveness of our approach. For instance, at 2.54 bits, MC compresses 76.6% of the model, with only a 3.8% average accuracy loss. During dynamic inference, we further reduce activated parameters by 15%, with a performance drop of less than 0.6%.