🤖 AI Summary
To address the low inference efficiency of Mixture-of-Experts (MoE) models on personal devices with limited GPU memory and under single-user, batch-size-one settings, this paper proposes a sparsity-aware dynamic expert caching mechanism. Methodologically, it introduces, for the first time, a lightweight runtime tracking module that monitors expert activation sparsity patterns during decoding in real time, enabling adaptive cache replacement and prefetching co-optimization. The mechanism supports plug-and-play integration with mainstream inference frameworks such as vLLM and Ollama. Its key contribution lies in breaking traditional memory and latency bottlenecks of MoE inference on edge devices: on models including DeepSeek-MoE and Mixtral, it reduces per-token latency by 3.1×–16.7× compared to baseline systems—significantly outperforming vLLM, Ollama, DeepSpeed, and BrainStorm—and achieves, for the first time, efficient local MoE inference.
📝 Abstract
This paper presents MoE-Infinity, an efficient MoE inference system designed for personal machines with limited GPU memory capacity. The key idea for MoE-Infinity is that on personal machines, which are often single-user environments, MoE-based LLMs typically operate with a batch size of one. In this setting, MoE models exhibit a high degree of activation sparsity, meaning a small number of experts are frequently reused in generating tokens during the decode phase. Leveraging this idea, we design a sparsity-aware expert cache, which can trace the sparse activation of experts during inference and carefully select the trace that represents the sparsity pattern. By analyzing these selected traces, MoE-Infinity guides the replacement and prefetching of the expert cache, providing 3.1-16.7x per-token latency improvements over numerous state-of-the-art systems, including vLLM, Ollama, DeepSpeed and BrainStorm across various MoE models (DeepSeek and Mixtral) when handling different LLM tasks. MoE-Infinity's source code is publicly available at https://github.com/EfficientMoE/MoE-Infinity