DyMoE: Dynamic Expert Orchestration with Mixed-Precision Quantization for Efficient MoE Inference on Edge

📅 2026-03-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenges of deploying Mixture-of-Experts (MoE) models on edge devices, where high memory consumption and I/O overhead severely constrain inference efficiency, and existing static approaches struggle to balance latency and accuracy. To overcome these limitations, we propose a dynamic mixed-precision quantization framework that, for the first time, integrates expert importance awareness, deep adaptive scheduling, and lookahead prefetching. This enables runtime selection of per-expert precision while overlapping computation and I/O operations. Evaluated on commodity edge hardware, our method achieves substantial improvements: time-to-first-token (TTFT) latency is reduced by 3.44–22.7×, and time-per-output-token (TPOT) is accelerated by up to 14.58×, all while preserving model accuracy and outperforming current offloading strategies.

Technology Category

Application Category

📝 Abstract
Despite the computational efficiency of MoE models, the excessive memory footprint and I/O overhead inherent in multi-expert architectures pose formidable challenges for real-time inference on resource-constrained edge platforms. While existing static methods struggle with a rigid latency-accuracy trade-off, we observe that expert importance is highly skewed and depth-dependent. Motivated by these insights, we propose DyMoE, a dynamic mixed-precision quantization framework designed for high-performance edge inference. Leveraging insights into expert importance skewness and depth-dependent sensitivity, DyMoE introduces: (1) importance-aware prioritization to dynamically quantize experts at runtime; (2) depth-adaptive scheduling to preserve semantic integrity in critical layers; and (3) look-ahead prefetching to overlap I/O stalls. Experimental results on commercial edge hardware show that DyMoE reduces Time-to-First-Token (TTFT) by 3.44x-22.7x and up to a 14.58x speedup in Time-Per-Output-Token (TPOT) compared to state-of-the-art offloading baselines, enabling real-time, accuracy-preserving MoE inference on resource-constrained edge devices.
Problem

Research questions and friction points this paper is trying to address.

MoE
edge inference
memory footprint
I/O overhead
resource-constrained
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic Quantization
Mixture-of-Experts
Edge Inference
Mixed-Precision
Expert Orchestration
🔎 Similar Papers
No similar papers found.
Y
Yuegui Huang
Sun Yat-sen University, Guangzhou, China
Z
Zhiyuan Fang
Sun Yat-sen University, Guangzhou, China
Weiqi Luo
Weiqi Luo
School of Computer, Sun Yat-Sen Univ. Guangzhou, P.R. China
Steganography and SteganalysisMultimedia ForensicsAI Security
R
Ruoyu Wu
Tencent, Shenzhen, China
W
Wuhui Chen
Sun Yat-sen University, Guangzhou, China
Zibin Zheng
Zibin Zheng
IEEE Fellow, Highly Cited Researcher, Sun Yat-sen University, China
BlockchainSmart ContractServices ComputingSoftware Reliability