Dynamic Expert Quantization for Scalable Mixture-of-Experts Inference

📅 2025-11-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address memory constraints and accuracy degradation in deploying Mixture-of-Experts (MoE) models on consumer-grade GPUs—caused by high memory overhead from inactive experts and the inability of static quantization to adapt to dynamic expert activation patterns—this paper proposes DynaExq, a dynamic quantization system. Its core innovation lies in modeling expert precision as a runtime-schedulable resource, enabled by a heat-aware controller, an asynchronous precision-switching pipeline, and a fragmentation-free memory pool, thereby supporting fine-grained, low-overhead mixed-precision coexistence of experts. DynaExq is compatible with large-scale MoE models such as Qwen3 and successfully deploys 30B–80B MoE models on a single RTX 5090 or A6000 GPU. Compared to static quantization, it achieves up to a 4.03-point improvement in accuracy, significantly enhancing inference efficiency and stability under memory-constrained conditions.

Technology Category

Application Category

📝 Abstract
Mixture-of-Experts (MoE) models scale LLM capacity efficiently, but deployment on consumer GPUs is limited by the large memory footprint of inactive experts. Static post-training quantization reduces storage costs but cannot adapt to shifting activation patterns, causing accuracy loss under aggressive compression. So we present DynaExq, a runtime system that treats expert precision as a first-class, dynamically managed resource. DynaExq combines (1) a hotness-aware precision controller that continuously aligns expert bit-widths with long-term activation statistics, (2) a fully asynchronous precision-switching pipeline that overlaps promotion and demotion with MoE computation, and (3) a fragmentation-free memory pooling mechanism that supports hybrid-precision experts with deterministic allocation. Together, these components enable stable, non-blocking precision transitions under strict HBM budgets. Across Qwen3-30B and Qwen3-80B MoE models and six representative benchmarks, DynaExq deploys large LLMs on single RTX 5090 and A6000 GPUs and improves accuracy by up to 4.03 points over static low-precision baselines. The results show that adaptive, workload-aware quantization is an effective strategy for memory-constrained MoE serving.
Problem

Research questions and friction points this paper is trying to address.

Reducing memory footprint of inactive experts in MoE models for consumer GPU deployment
Adapting quantization precision to shifting activation patterns to prevent accuracy loss
Managing expert precision dynamically under strict memory budgets during inference
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic expert precision management for MoE models
Hotness-aware bit-width controller using activation statistics
Asynchronous precision switching with fragmentation-free memory pooling
🔎 Similar Papers
No similar papers found.
Kexin Chu
Kexin Chu
Ph.D Student of Computer Science, University of Connecticut
LLM inference accelerationSecurity
Dawei Xiang
Dawei Xiang
University of Connecticut
computer visionartificial intelligencebiomedical informaticsdeep learning
Z
Zixu Shen
University of Connecticut, Storrs, CT, USA
Y
Yiwei Yang
University of California, Santa Cruz, Santa Cruz, CA, USA
Z
Zecheng Lin
Independent Researcher, USA
W
Wei Zhang
University of Connecticut, Storrs, CT, USA