D$^{2}$MoE: Dual Routing and Dynamic Scheduling for Efficient On-Device MoE-based LLM Serving

📅 2025-04-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To enable efficient deployment of Mixture-of-Experts (MoE) large language models on edge devices, this work addresses the fundamental limitation of static compression strategies—namely, their inability to jointly optimize service quality and computational overhead under multi-request workloads. We propose a dynamic expert-bitwidth allocation and I/O-compute co-scheduling framework. First, we introduce nested Matryoshka Weight Quantization (MWQ), the first technique enabling fine-grained, expert-level adjustable precision. Second, we design the hottest-expert-bit-first (HEBF) heuristic for real-time scheduling, enabling memory-constrained parallel expert loading and overlapping computation with memory access. Evaluated on real edge hardware, our approach achieves 1.39× higher inference throughput and 53% lower peak memory usage while maintaining accuracy comparable to INT8 baselines. This is the first work to deeply integrate dynamic quantization, expert routing, and system-level pipelined scheduling—significantly enhancing MoE model efficiency for on-device deployment under stringent resource constraints.

Technology Category

Application Category

📝 Abstract
The mixture of experts (MoE) model is a sparse variant of large language models (LLMs), designed to hold a better balance between intelligent capability and computational overhead. Despite its benefits, MoE is still too expensive to deploy on resource-constrained edge devices, especially with the demands of on-device inference services. Recent research efforts often apply model compression techniques, such as quantization, pruning and merging, to restrict MoE complexity. Unfortunately, due to their predefined static model optimization strategies, they cannot always achieve the desired quality-overhead trade-off when handling multiple requests, finally degrading the on-device quality of service. These limitations motivate us to propose the D$^2$MoE, an algorithm-system co-design framework that matches diverse task requirements by dynamically allocating the most proper bit-width to each expert. Specifically, inspired by the nested structure of matryoshka dolls, we propose the matryoshka weight quantization (MWQ) to progressively compress expert weights in a bit-nested manner and reduce the required runtime memory. On top of it, we further optimize the I/O-computation pipeline and design a heuristic scheduling algorithm following our hottest-expert-bit-first (HEBF) principle, which maximizes the expert parallelism between I/O and computation queue under constrained memory budgets, thus significantly reducing the idle temporal bubbles waiting for the experts to load. Evaluations on real edge devices show that D$^2$MoE improves the overall inference throughput by up to 1.39$ imes$ and reduces the peak memory footprint by up to 53% over the latest on-device inference frameworks, while still preserving comparable serving accuracy as its INT8 counterparts.
Problem

Research questions and friction points this paper is trying to address.

Optimizing on-device MoE-based LLM serving efficiency
Dynamic bit-width allocation for diverse task requirements
Reducing memory footprint and idle time in inference
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic bit-width allocation for MoE experts
Matryoshka weight quantization for nested compression
Hottest-expert-bit-first scheduling for parallelism
🔎 Similar Papers
No similar papers found.