MoQE: Improve Quantization Model performance via Mixture of Quantization Experts

📅 2025-08-09

📈 Citations: 0

✨ Influential: 0

career value

219K/year

🤖 AI Summary

Quantization often incurs accuracy degradation, hindering efficient model deployment. To address this, we propose MoQE—the first framework to integrate the Mixture of Experts (MoE) paradigm into model quantization. MoQE treats multiple quantized variants of the same floating-point model as specialized “quantization experts” and introduces a lightweight, architecture-aware dynamic router that adaptively selects the optimal quantization strategy based on input features. The method is validated across diverse tasks: computer vision (ResNet on ImageNet) and natural language processing (LLaMA and Qwen on WikiText, C4, and OpenWebText), consistently achieving state-of-the-art quantization accuracy without significant inference latency overhead. Our core contribution lies in pioneering the application of MoE to quantization adaptation—enabling input-aware, fine-grained trade-offs between accuracy and efficiency.

Technology Category

Application Category

📝 Abstract

Quantization method plays a crucial role in improving model efficiency and reducing deployment costs, enabling the widespread application of deep learning models on resource-constrained devices. However, the quantization process inevitably introduces accuracy degradation. In this paper, we propose Mixture of Quantization Experts( abbr. MoQE), a quantization inference framework based on the Mixture-of-Experts (MoE) architecture, aiming to jointly improve the performance of quantization models. MoQE combines multiple quantization variants of one full-precision model as specialized "quantization experts" and dynamically routes input data to the most suitable expert based on its characteristics. MoQE alleviates the performance degradation commonly seen in single quantization models through specialization quantization expert models. We design lightweight, structure-aware router models tailored for both CV and NLP tasks. Experimental evaluations on ResNet, LLaMA, and Qwen model families across benchmark datasets including ImageNet, WikiText, C4, and OpenWebText demonstrate that MoQE achieves performance comparable to SOTA quantization model, without incurring significant increases in inference latency.

Problem

Research questions and friction points this paper is trying to address.

Reduces accuracy loss in quantized deep learning models

Dynamically selects optimal quantization experts per input

Improves efficiency without increasing inference latency significantly

Innovation

Methods, ideas, or system contributions that make the work stand out.

Mixture of Quantization Experts framework

Dynamic routing to specialized experts

Lightweight structure-aware router models

🔎 Similar Papers

QuantMoE-Bench: Examining Post-Training Quantization for Mixture-of-Experts