Dynamic Expert Quantization for Scalable Mixture-of-Experts Inference

📅 2025-11-19

📈 Citations: 0

✨ Influential: 0

career value

242K/year

🤖 AI Summary

To address memory constraints and accuracy degradation in deploying Mixture-of-Experts (MoE) models on consumer-grade GPUs—caused by high memory overhead from inactive experts and the inability of static quantization to adapt to dynamic expert activation patterns—this paper proposes DynaExq, a dynamic quantization system. Its core innovation lies in modeling expert precision as a runtime-schedulable resource, enabled by a heat-aware controller, an asynchronous precision-switching pipeline, and a fragmentation-free memory pool, thereby supporting fine-grained, low-overhead mixed-precision coexistence of experts. DynaExq is compatible with large-scale MoE models such as Qwen3 and successfully deploys 30B–80B MoE models on a single RTX 5090 or A6000 GPU. Compared to static quantization, it achieves up to a 4.03-point improvement in accuracy, significantly enhancing inference efficiency and stability under memory-constrained conditions.

Technology Category

Application Category

📝 Abstract

Mixture-of-Experts (MoE) models scale LLM capacity efficiently, but deployment on consumer GPUs is limited by the large memory footprint of inactive experts. Static post-training quantization reduces storage costs but cannot adapt to shifting activation patterns, causing accuracy loss under aggressive compression. So we present DynaExq, a runtime system that treats expert precision as a first-class, dynamically managed resource. DynaExq combines (1) a hotness-aware precision controller that continuously aligns expert bit-widths with long-term activation statistics, (2) a fully asynchronous precision-switching pipeline that overlaps promotion and demotion with MoE computation, and (3) a fragmentation-free memory pooling mechanism that supports hybrid-precision experts with deterministic allocation. Together, these components enable stable, non-blocking precision transitions under strict HBM budgets. Across Qwen3-30B and Qwen3-80B MoE models and six representative benchmarks, DynaExq deploys large LLMs on single RTX 5090 and A6000 GPUs and improves accuracy by up to 4.03 points over static low-precision baselines. The results show that adaptive, workload-aware quantization is an effective strategy for memory-constrained MoE serving.

Problem

Research questions and friction points this paper is trying to address.

Reducing memory footprint of inactive experts in MoE models for consumer GPU deployment

Adapting quantization precision to shifting activation patterns to prevent accuracy loss

Managing expert precision dynamically under strict memory budgets during inference

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic expert precision management for MoE models

Hotness-aware bit-width controller using activation statistics

Asynchronous precision switching with fragmentation-free memory pooling

🔎 Similar Papers

On Expert Estimation in Hierarchical Mixture of Experts: Beyond Softmax Gating Functions