MoQa: Rethinking MoE Quantization with Multi-stage Data-model Distribution Awareness

📅 2025-03-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing quantization methods designed for dense LLMs are ill-suited for Mixture-of-Experts (MoE) models due to their sparse activation patterns, dynamic data-parameter mappings, and strong inter-expert correlations. This work proposes the first MoE-specific, multi-stage, distribution-aware quantization framework. It introduces a novel distribution-decoupling analytical paradigm, integrating expert importance scoring, hierarchical sensitivity modeling, scenario-adaptive mixed-precision assignment, and sparse-activation-aware calibration—enabling fine-grained, adaptive mixed-precision quantization. Theoretically, we characterize the impact of each quantization stage on overall performance, yielding new insights into MoE quantization behavior. Empirically, our method reduces perplexity by 1.69–2.18 on language modeling benchmarks and improves zero-shot accuracy by 1.58%–8.91%, significantly outperforming prior MoE quantization approaches.

Technology Category

Application Category

📝 Abstract
With the advances in artificial intelligence, Mix-of-Experts (MoE) has become the main form of Large Language Models (LLMs), and its demand for model compression is increasing. Quantization is an effective method that not only compresses the models but also significantly accelerates their performance. Existing quantization methods have gradually shifted the focus from parameter scaling to the analysis of data distributions. However, their analysis is designed for dense LLMs and relies on the simple one-model-all-data mapping, which is unsuitable for MoEs. This paper proposes a new quantization framework called MoQa. MoQa decouples the data-model distribution complexity of MoEs in multiple analysis stages, quantitively revealing the dynamics during sparse data activation, data-parameter mapping, and inter-expert correlations. Based on these, MoQa identifies particular experts' and parameters' significance with optimal data-model distribution awareness and proposes a series of fine-grained mix-quantization strategies adaptive to various data activation and expert combination scenarios. Moreover, MoQa discusses the limitations of existing quantization and analyzes the impact of each stage analysis, showing novel insights for MoE quantization. Experiments show that MoQa achieves a 1.69~2.18 perplexity decrease in language modeling tasks and a 1.58%~8.91% accuracy improvement in zero-shot inference tasks. We believe MoQa will play a role in future MoE construction, optimization, and compression.
Problem

Research questions and friction points this paper is trying to address.

Quantizing MoE models with multi-stage data-model distribution awareness
Addressing limitations of existing dense LLM quantization for sparse MoEs
Optimizing mix-quantization strategies for varied data activation patterns
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-stage data-model distribution awareness
Fine-grained mix-quantization strategies
Optimal data-model distribution awareness
🔎 Similar Papers
No similar papers found.
Z
Zihao Zheng
Peking University
X
Xiuping Cui
Peking University
Size Zheng
Size Zheng
ByteDance Seed
ArchitectureCompilerDeep Learning
M
Maoliang Li
Peking University
J
Jiayu Chen
Peking University
Yun (Eric) Liang
Yun (Eric) Liang
Professor of EECS, Peking University, ACM Distinguished Scientist
electronics design automationhardware and software co-designcomputer architectureand embeded
X
Xiang Chen
Peking University