🤖 AI Summary
To address memory constraints and accuracy degradation in deploying Mixture-of-Experts (MoE) models on consumer-grade GPUs—caused by high memory overhead from inactive experts and the inability of static quantization to adapt to dynamic expert activation patterns—this paper proposes DynaExq, a dynamic quantization system. Its core innovation lies in modeling expert precision as a runtime-schedulable resource, enabled by a heat-aware controller, an asynchronous precision-switching pipeline, and a fragmentation-free memory pool, thereby supporting fine-grained, low-overhead mixed-precision coexistence of experts. DynaExq is compatible with large-scale MoE models such as Qwen3 and successfully deploys 30B–80B MoE models on a single RTX 5090 or A6000 GPU. Compared to static quantization, it achieves up to a 4.03-point improvement in accuracy, significantly enhancing inference efficiency and stability under memory-constrained conditions.
📝 Abstract
Mixture-of-Experts (MoE) models scale LLM capacity efficiently, but deployment on consumer GPUs is limited by the large memory footprint of inactive experts. Static post-training quantization reduces storage costs but cannot adapt to shifting activation patterns, causing accuracy loss under aggressive compression. So we present DynaExq, a runtime system that treats expert precision as a first-class, dynamically managed resource. DynaExq combines (1) a hotness-aware precision controller that continuously aligns expert bit-widths with long-term activation statistics, (2) a fully asynchronous precision-switching pipeline that overlaps promotion and demotion with MoE computation, and (3) a fragmentation-free memory pooling mechanism that supports hybrid-precision experts with deterministic allocation. Together, these components enable stable, non-blocking precision transitions under strict HBM budgets. Across Qwen3-30B and Qwen3-80B MoE models and six representative benchmarks, DynaExq deploys large LLMs on single RTX 5090 and A6000 GPUs and improves accuracy by up to 4.03 points over static low-precision baselines. The results show that adaptive, workload-aware quantization is an effective strategy for memory-constrained MoE serving.