Efficient Quantization of Mixture-of-Experts with Theoretical Generalization Guarantees

📅 2026-04-07

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

This work addresses the performance degradation and memory overhead in low-bit quantization of sparse Mixture-of-Experts (MoE) models caused by uniform precision allocation. The authors propose a novel mixed-precision quantization method tailored to expert-specific characteristics. They theoretically establish, for the first time, a connection between an expert’s ability to capture salient features and its sensitivity to quantization. Leveraging the L2 norm dynamics of router weights and the maximum neuron-wise variance during training, the method dynamically assigns bit-widths without incurring additional computational cost. Experiments on large-scale MoE architectures such as Switch Transformer and Mixtral demonstrate that the approach significantly outperforms existing techniques, achieving higher accuracy at substantially reduced inference costs while introducing negligible overhead for bit-width allocation.

Technology Category

Application Category

📝 Abstract

Sparse Mixture-of-Experts (MoE) allows scaling of language and vision models efficiently by activating only a small subset of experts per input. While this reduces computation, the large number of parameters still incurs substantial memory overhead during inference. Post-training quantization has been explored to address this issue. Because uniform quantization suffers from significant accuracy loss at low bit-widths, mixed-precision methods have been recently explored; however, they often require substantial computation for bit-width allocation and overlook the varying sensitivity of model performance to the quantization of different experts. We propose a theoretically grounded expert-wise mixed precision strategy that assigns bit-width to each expert primarily based on their change in routers l2 norm during training. Experts with smaller changes are shown to capture less frequent but critical features, and model performance is more sensitive to the quantization of these experts, thus requiring higher precision. Furthermore, to avoid allocating experts to lower precision that inject high quantization noise, experts with large maximum intra-neuron variance are also allocated higher precision. Experiments on large-scale MoE models, including Switch Transformer and Mixtral, show that our method achieves higher accuracy than existing approaches, while also reducing inference cost and incurring only negligible overhead for bit-width assignment.

Problem

Research questions and friction points this paper is trying to address.

Mixture-of-Experts

Post-training quantization

Mixed-precision

Quantization sensitivity

Memory overhead

Innovation

Methods, ideas, or system contributions that make the work stand out.

Mixture-of-Experts

mixed-precision quantization

theoretical generalization