OD-MoE: On-Demand Expert Loading for Cacheless Edge-Distributed MoE Inference

📅 2025-12-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenge of deploying Mixture-of-Experts (MoE) models on memory-constrained edge devices (<1 GB GPU memory), this paper proposes OD-MoE—a distributed MoE inference framework that eliminates the need for expert caching. Its core contributions are: (i) cross-node parallel loading and computation, decoupling parameter storage from computation; (ii) a multi-layer forward-activation–based prefetching mechanism that achieves 99.94% expert activation prediction accuracy; and (iii) dynamic real-time parameter scheduling, removing reliance on local expert caches. Experiments demonstrate that OD-MoE achieves 75% of the decoding throughput of full-caching baselines while consuming only one-third of their GPU memory. This significantly lowers the deployment barrier for MoE models on edge devices, enabling high-accuracy, low-latency inference under stringent memory constraints.

Technology Category

Application Category

📝 Abstract
Mixture-of-Experts (MoE), while offering significant advantages as a Large Language Model (LLM) architecture, faces substantial challenges when deployed on low-cost edge devices with tight memory constraints. Expert offloading mitigates this issue by storing expert parameters in CPU memory and caching a subset of popular experts in GPU memory. Although this approach improves GPU memory utilization by caching only the likely-used experts, the GPU memory reserved for expert caching is underutilized compared with dense LLMs. This paper presents OD-MoE, a distributed MoE inference framework that obviates the need for expert caches via fully on-demand expert loading. OD-MoE is built upon two key mechanisms: 1) parallelizing expert loading and expert computation across distributed edge nodes, and 2) an ultra-accurate emulative predictor that forecasts expert activations multiple layers ahead while expert computation is ongoing. With these innovations, OD-MoE dynamically loads each target expert to one of the distributed nodes just-in-time before its activation and promptly evicts it afterward, freeing GPU memory for subsequent experts. We comprehensively benchmark OD-MoE against state-of-the-art MoE offloading systems on a ten-node testbed. Experimental results show that: 1) OD-MoE achieves 99.94% expert activation prediction accuracy, substantially surpassing all existing methods; and 2) OD-MoE delivers approximately 75% of the decoding speed of a fully GPU-cached MoE deployment while using only 1/3 of the GPU memory. More importantly, by eliminating the need for expert caches, OD-MoE enables MoE inference on edge nodes with less-than-1GB GPU memory, paving the way for practical MoE deployment of low-cost IoT devices at the edge in the LLM era.
Problem

Research questions and friction points this paper is trying to address.

Optimizing memory usage for Mixture-of-Experts models on edge devices
Eliminating expert caches to enable inference with limited GPU memory
Distributing expert loading and computation across nodes for efficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

On-demand expert loading without caching for edge-distributed MoE inference
Parallel expert loading and computation across distributed edge nodes
Ultra-accurate emulative predictor forecasting expert activations multiple layers ahead
🔎 Similar Papers
No similar papers found.
L
Liujianfu Wang
The Chinese University of Hong Kong
Yuyang Du
Yuyang Du
Department of Information Engineering, CUHK
Generative AIsWireless CommunicationNetworking
Y
Yuchen Pan
The Chinese University of Hong Kong
S
S. Liew
The Chinese University of Hong Kong
J
Jiacheng Liu
The Chinese University of Hong Kong
Kexin Chen
Kexin Chen
CUHK
LLM/VLMsAI AgentMulti-modality LearningAI for Science