GEMQ: Global Expert-Level Mixed-Precision Quantization for MoE LLMs

📅 2026-05-21

📈 Citations: 0

✨ Influential: 0

career value

198K/year

🤖 AI Summary

This work addresses the high memory overhead of Mixture-of-Experts (MoE) large language models, which stems from their massive expert parameters, and overcomes limitations of existing mixed-precision quantization methods that rely on local importance estimation and neglect routing shifts induced by quantization. The authors propose GEMQ, a global expert-level mixed-precision quantization framework that introduces, for the first time, model-wide expert importance evaluation. Expert importance is modeled via global linear programming grounded in quantization error analysis, and integrated with a quantization-aware routing fine-tuning mechanism. Within a progressive quantization pipeline, GEMQ iteratively optimizes both expert precision and routing behavior, transcending the constraints of layer-wise local optimization. This approach substantially reduces memory consumption and accelerates inference while incurring minimal accuracy degradation, thereby demonstrating the feasibility of extreme low-bit quantization for MoE models.

📝 Abstract

Mixture-of-Experts Large Language Models (MoE-LLMs) achieve strong performance but incur substantial memory overhead due to massive expert parameters. Mixed-precision quantization mitigates this cost by allocating expert-wise bit-widths based on their importance, approaching the accuracy-memory Pareto frontier and enabling extreme low-bit quantization. However, existing methods rely on layer-wise importance estimation and overlook router shifts induced by quantization, resulting in suboptimal allocation and routing. In this work, we propose Global Expert-level Mixed-precision Quantization (GEMQ) to overcome these limitations via (1) a global linear-programming formulation that captures model-wide expert importance based on quantization error analysis, and (2) efficient router fine-tuning to adapt routing to quantized experts. These components are integrated into a progressive quantization framework that iteratively refines importance estimation and allocation. Experiments demonstrate that GEMQ significantly reduces memory and accelerates inference with minimal accuracy degradation. Source code is available at https://github.com/jndeng/GEMQ .

Problem

Research questions and friction points this paper is trying to address.

Mixture-of-Experts

mixed-precision quantization

router shift

expert importance

quantization error

Innovation

Methods, ideas, or system contributions that make the work stand out.

mixed-precision quantization

Mixture-of-Experts

global importance estimation