OD-MoE: On-Demand Expert Loading for Cacheless Edge-Distributed MoE Inference

📅 2025-12-03

📈 Citations: 0

✨ Influential: 0

career value

251K/year

🤖 AI Summary

To address the challenge of deploying Mixture-of-Experts (MoE) models on memory-constrained edge devices (<1 GB GPU memory), this paper proposes OD-MoE—a distributed MoE inference framework that eliminates the need for expert caching. Its core contributions are: (i) cross-node parallel loading and computation, decoupling parameter storage from computation; (ii) a multi-layer forward-activation–based prefetching mechanism that achieves 99.94% expert activation prediction accuracy; and (iii) dynamic real-time parameter scheduling, removing reliance on local expert caches. Experiments demonstrate that OD-MoE achieves 75% of the decoding throughput of full-caching baselines while consuming only one-third of their GPU memory. This significantly lowers the deployment barrier for MoE models on edge devices, enabling high-accuracy, low-latency inference under stringent memory constraints.

Technology Category

Application Category

📝 Abstract

Mixture-of-Experts (MoE), while offering significant advantages as a Large Language Model (LLM) architecture, faces substantial challenges when deployed on low-cost edge devices with tight memory constraints. Expert offloading mitigates this issue by storing expert parameters in CPU memory and caching a subset of popular experts in GPU memory. Although this approach improves GPU memory utilization by caching only the likely-used experts, the GPU memory reserved for expert caching is underutilized compared with dense LLMs. This paper presents OD-MoE, a distributed MoE inference framework that obviates the need for expert caches via fully on-demand expert loading. OD-MoE is built upon two key mechanisms: 1) parallelizing expert loading and expert computation across distributed edge nodes, and 2) an ultra-accurate emulative predictor that forecasts expert activations multiple layers ahead while expert computation is ongoing. With these innovations, OD-MoE dynamically loads each target expert to one of the distributed nodes just-in-time before its activation and promptly evicts it afterward, freeing GPU memory for subsequent experts. We comprehensively benchmark OD-MoE against state-of-the-art MoE offloading systems on a ten-node testbed. Experimental results show that: 1) OD-MoE achieves 99.94% expert activation prediction accuracy, substantially surpassing all existing methods; and 2) OD-MoE delivers approximately 75% of the decoding speed of a fully GPU-cached MoE deployment while using only 1/3 of the GPU memory. More importantly, by eliminating the need for expert caches, OD-MoE enables MoE inference on edge nodes with less-than-1GB GPU memory, paving the way for practical MoE deployment of low-cost IoT devices at the edge in the LLM era.

Problem

Research questions and friction points this paper is trying to address.

Optimizing memory usage for Mixture-of-Experts models on edge devices

Eliminating expert caches to enable inference with limited GPU memory

Distributing expert loading and computation across nodes for efficiency

Innovation

Methods, ideas, or system contributions that make the work stand out.

On-demand expert loading without caching for edge-distributed MoE inference

Parallel expert loading and computation across distributed edge nodes

Ultra-accurate emulative predictor forecasting expert activations multiple layers ahead

🔎 Similar Papers

No similar papers found.