🤖 AI Summary
Quantization often incurs accuracy degradation, hindering efficient model deployment. To address this, we propose MoQE—the first framework to integrate the Mixture of Experts (MoE) paradigm into model quantization. MoQE treats multiple quantized variants of the same floating-point model as specialized “quantization experts” and introduces a lightweight, architecture-aware dynamic router that adaptively selects the optimal quantization strategy based on input features. The method is validated across diverse tasks: computer vision (ResNet on ImageNet) and natural language processing (LLaMA and Qwen on WikiText, C4, and OpenWebText), consistently achieving state-of-the-art quantization accuracy without significant inference latency overhead. Our core contribution lies in pioneering the application of MoE to quantization adaptation—enabling input-aware, fine-grained trade-offs between accuracy and efficiency.
📝 Abstract
Quantization method plays a crucial role in improving model efficiency and reducing deployment costs, enabling the widespread application of deep learning models on resource-constrained devices. However, the quantization process inevitably introduces accuracy degradation. In this paper, we propose Mixture of Quantization Experts( abbr. MoQE), a quantization inference framework based on the Mixture-of-Experts (MoE) architecture, aiming to jointly improve the performance of quantization models. MoQE combines multiple quantization variants of one full-precision model as specialized "quantization experts" and dynamically routes input data to the most suitable expert based on its characteristics. MoQE alleviates the performance degradation commonly seen in single quantization models through specialization quantization expert models. We design lightweight, structure-aware router models tailored for both CV and NLP tasks. Experimental evaluations on ResNet, LLaMA, and Qwen model families across benchmark datasets including ImageNet, WikiText, C4, and OpenWebText demonstrate that MoQE achieves performance comparable to SOTA quantization model, without incurring significant increases in inference latency.