Enabling MoE on the Edge via Importance-Driven Expert Scheduling

๐Ÿ“… 2025-08-26
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF

career value

257K/year
๐Ÿค– AI Summary
To address the challenge of efficiently deploying Mixture-of-Experts (MoE) models on memory-constrained edge devices, this paper proposes a dynamic offloading and cache-reuse scheduling method grounded in expert importance assessment. The core method quantifies expert importance for scheduling decisions and leverages functional similarity matching to dynamically evict low-importance experts from GPU cache while reusing high-similarity expertsโ€”thereby jointly optimizing GPU memory footprint and PCIe data transfer overhead. This approach achieves a cache hit rate exceeding 60%, reduces decoding latency by 48%, and preserves near-lossless model accuracy. The contribution lies in the first integration of importance-aware expert quantification into MoE scheduling, coupled with similarity-driven cache reuse, enabling scalable, high-efficiency system-level optimization for MoE inference under stringent edge resource constraints.

Technology Category

Application Category

๐Ÿ“ Abstract
The Mixture of Experts (MoE) architecture has emerged as a key technique for scaling Large Language Models by activating only a subset of experts per query. Deploying MoE on consumer-grade edge hardware, however, is constrained by limited device memory, making dynamic expert offloading essential. Unlike prior work that treats offloading purely as a scheduling problem, we leverage expert importance to guide decisions, substituting low-importance activated experts with functionally similar ones already cached in GPU memory, thereby preserving accuracy. As a result, this design reduces memory usage and data transfer, while largely eliminating PCIe overhead. In addition, we introduce a scheduling policy that maximizes the reuse ratio of GPU-cached experts, further boosting efficiency. Extensive evaluations show that our approach delivers 48% lower decoding latency with over 60% expert cache hit rate, while maintaining nearly lossless accuracy.
Problem

Research questions and friction points this paper is trying to address.

Optimizing expert scheduling for MoE on memory-constrained edge devices
Reducing PCIe overhead and data transfer via importance-driven expert substitution
Maintaining model accuracy while improving decoding latency and cache efficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

Importance-driven expert scheduling for MoE
Substituting low-importance experts with cached ones
Maximizing GPU-cached expert reuse ratio
๐Ÿ”Ž Similar Papers
No similar papers found.