Enabling MoE on the Edge via Importance-Driven Expert Scheduling

๐Ÿ“… 2025-08-26
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
To address the challenge of efficiently deploying Mixture-of-Experts (MoE) models on memory-constrained edge devices, this paper proposes a dynamic offloading and cache-reuse scheduling method grounded in expert importance assessment. The core method quantifies expert importance for scheduling decisions and leverages functional similarity matching to dynamically evict low-importance experts from GPU cache while reusing high-similarity expertsโ€”thereby jointly optimizing GPU memory footprint and PCIe data transfer overhead. This approach achieves a cache hit rate exceeding 60%, reduces decoding latency by 48%, and preserves near-lossless model accuracy. The contribution lies in the first integration of importance-aware expert quantification into MoE scheduling, coupled with similarity-driven cache reuse, enabling scalable, high-efficiency system-level optimization for MoE inference under stringent edge resource constraints.

Technology Category

Application Category

๐Ÿ“ Abstract
The Mixture of Experts (MoE) architecture has emerged as a key technique for scaling Large Language Models by activating only a subset of experts per query. Deploying MoE on consumer-grade edge hardware, however, is constrained by limited device memory, making dynamic expert offloading essential. Unlike prior work that treats offloading purely as a scheduling problem, we leverage expert importance to guide decisions, substituting low-importance activated experts with functionally similar ones already cached in GPU memory, thereby preserving accuracy. As a result, this design reduces memory usage and data transfer, while largely eliminating PCIe overhead. In addition, we introduce a scheduling policy that maximizes the reuse ratio of GPU-cached experts, further boosting efficiency. Extensive evaluations show that our approach delivers 48% lower decoding latency with over 60% expert cache hit rate, while maintaining nearly lossless accuracy.
Problem

Research questions and friction points this paper is trying to address.

Optimizing expert scheduling for MoE on memory-constrained edge devices
Reducing PCIe overhead and data transfer via importance-driven expert substitution
Maintaining model accuracy while improving decoding latency and cache efficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

Importance-driven expert scheduling for MoE
Substituting low-importance experts with cached ones
Maximizing GPU-cached expert reuse ratio
๐Ÿ”Ž Similar Papers
No similar papers found.
G
Guoying Zhu
Nanjing University
M
Meng Li
Nanjing University
Haipeng Dai
Haipeng Dai
Nanjing University
Wireless sensor networkswireless power transfer
Xuechen Liu
Xuechen Liu
National Institute of Informatics
speaker verificationspeech recognitionspoofing detection
Weijun Wang
Weijun Wang
Tsinghua University
LLM Serving SystemEdge AIVideo Analytics System
K
Keran Li
Nanjing University
J
Jun xiao
Honor Device Co.,Ltd
L
Ligeng Chen
Honor Device Co.,Ltd
W
Wei Wang
Nanjing University