MoQE: Improve Quantization Model performance via Mixture of Quantization Experts

📅 2025-08-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Quantization often incurs accuracy degradation, hindering efficient model deployment. To address this, we propose MoQE—the first framework to integrate the Mixture of Experts (MoE) paradigm into model quantization. MoQE treats multiple quantized variants of the same floating-point model as specialized “quantization experts” and introduces a lightweight, architecture-aware dynamic router that adaptively selects the optimal quantization strategy based on input features. The method is validated across diverse tasks: computer vision (ResNet on ImageNet) and natural language processing (LLaMA and Qwen on WikiText, C4, and OpenWebText), consistently achieving state-of-the-art quantization accuracy without significant inference latency overhead. Our core contribution lies in pioneering the application of MoE to quantization adaptation—enabling input-aware, fine-grained trade-offs between accuracy and efficiency.

Technology Category

Application Category

📝 Abstract
Quantization method plays a crucial role in improving model efficiency and reducing deployment costs, enabling the widespread application of deep learning models on resource-constrained devices. However, the quantization process inevitably introduces accuracy degradation. In this paper, we propose Mixture of Quantization Experts( abbr. MoQE), a quantization inference framework based on the Mixture-of-Experts (MoE) architecture, aiming to jointly improve the performance of quantization models. MoQE combines multiple quantization variants of one full-precision model as specialized "quantization experts" and dynamically routes input data to the most suitable expert based on its characteristics. MoQE alleviates the performance degradation commonly seen in single quantization models through specialization quantization expert models. We design lightweight, structure-aware router models tailored for both CV and NLP tasks. Experimental evaluations on ResNet, LLaMA, and Qwen model families across benchmark datasets including ImageNet, WikiText, C4, and OpenWebText demonstrate that MoQE achieves performance comparable to SOTA quantization model, without incurring significant increases in inference latency.
Problem

Research questions and friction points this paper is trying to address.

Reduces accuracy loss in quantized deep learning models
Dynamically selects optimal quantization experts per input
Improves efficiency without increasing inference latency significantly
Innovation

Methods, ideas, or system contributions that make the work stand out.

Mixture of Quantization Experts framework
Dynamic routing to specialized experts
Lightweight structure-aware router models
🔎 Similar Papers
No similar papers found.
Jinhao Zhang
Jinhao Zhang
Harbin Institute of Technology, Shenzhen
Autonomous DrivingEmbodied AIGenerative Model
Yunquan Zhang
Yunquan Zhang
Professor of Institute of Computing Technology, CAS
parallel computingparallel programmingparallel computational model
B
Boyang Zhang
University of Chinese Academy of Sciences, Beijing, China; Peng Cheng Laboratory, Shenzhen, China
Z
Zeyu Liu
North University of China, Taiyuan, China
D
Daning Cheng
Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China