MxMoE: Mixed-precision Quantization for MoE with Accuracy and Performance Co-Design

📅 2025-05-09

📈 Citations: 0

✨ Influential: 0

career value

252K/year

🤖 AI Summary

To address the deployment challenges of Mixture-of-Experts (MoE) models—stemming from their massive parameter count and high computational overhead—this paper proposes the first hardware-aware mixed-precision quantization framework tailored for MoE architectures. Our method uniquely identifies and jointly models two key factors: (i) heterogeneous quantization sensitivity across linear layers, and (ii) skewed expert activation frequencies. Through algorithm-system co-design, it enables automated mixed-precision configuration search and generates optimized multi-precision GroupGEMM operators. Evaluated on Wikitext-2, our 2.25-bit quantized model achieves a perplexity reduction of 2.4 over GPTQ; it delivers a 3.4× speedup versus full-precision inference and up to 29.4% higher throughput than uniform 5-bit quantization at comparable accuracy—significantly improving the trade-off between inference efficiency and model fidelity.

Technology Category

Application Category

📝 Abstract

Mixture-of-Experts (MoE) models face deployment challenges due to their large parameter counts and computational demands. We explore quantization for MoE models and highlight two key insights: 1) linear blocks exhibit varying quantization sensitivity, and 2) divergent expert activation frequencies create heterogeneous computational characteristics. Based on these observations, we introduce MxMoE, a mixed-precision optimization framework for MoE models that considers both algorithmic and system perspectives. MxMoE navigates the design space defined by parameter sensitivity, expert activation dynamics, and hardware resources to derive efficient mixed-precision configurations. Additionally, MxMoE automatically generates optimized mixed-precision GroupGEMM kernels, enabling parallel execution of GEMMs with different precisions. Evaluations show that MxMoE outperforms existing methods, achieving 2.4 lower Wikitext-2 perplexity than GPTQ at 2.25-bit and delivering up to 3.4x speedup over full precision, as well as up to 29.4% speedup over uniform quantization at equivalent accuracy with 5-bit weight-activation quantization. Our code is available at https://github.com/cat538/MxMoE.

Problem

Research questions and friction points this paper is trying to address.

Addresses MoE model deployment challenges via mixed-precision quantization

Optimizes quantization sensitivity and expert activation dynamics jointly

Enables efficient mixed-precision GroupGEMM kernels for parallel execution

Innovation

Methods, ideas, or system contributions that make the work stand out.

Mixed-precision quantization for MoE models

Optimized mixed-precision GroupGEMM kernels

Balances accuracy and performance co-design

🔎 Similar Papers

No similar papers found.