MoE-Infinity: Efficient MoE Inference on Personal Machines with Sparsity-Aware Expert Cache

📅 2024-01-25

📈 Citations: 6

✨ Influential: 1

career value

281K/year

🤖 AI Summary

To address the low inference efficiency of Mixture-of-Experts (MoE) models on personal devices with limited GPU memory and under single-user, batch-size-one settings, this paper proposes a sparsity-aware dynamic expert caching mechanism. Methodologically, it introduces, for the first time, a lightweight runtime tracking module that monitors expert activation sparsity patterns during decoding in real time, enabling adaptive cache replacement and prefetching co-optimization. The mechanism supports plug-and-play integration with mainstream inference frameworks such as vLLM and Ollama. Its key contribution lies in breaking traditional memory and latency bottlenecks of MoE inference on edge devices: on models including DeepSeek-MoE and Mixtral, it reduces per-token latency by 3.1×–16.7× compared to baseline systems—significantly outperforming vLLM, Ollama, DeepSpeed, and BrainStorm—and achieves, for the first time, efficient local MoE inference.

Technology Category

Application Category

📝 Abstract

This paper presents MoE-Infinity, an efficient MoE inference system designed for personal machines with limited GPU memory capacity. The key idea for MoE-Infinity is that on personal machines, which are often single-user environments, MoE-based LLMs typically operate with a batch size of one. In this setting, MoE models exhibit a high degree of activation sparsity, meaning a small number of experts are frequently reused in generating tokens during the decode phase. Leveraging this idea, we design a sparsity-aware expert cache, which can trace the sparse activation of experts during inference and carefully select the trace that represents the sparsity pattern. By analyzing these selected traces, MoE-Infinity guides the replacement and prefetching of the expert cache, providing 3.1-16.7x per-token latency improvements over numerous state-of-the-art systems, including vLLM, Ollama, DeepSpeed and BrainStorm across various MoE models (DeepSeek and Mixtral) when handling different LLM tasks. MoE-Infinity's source code is publicly available at https://github.com/EfficientMoE/MoE-Infinity

Problem

Research questions and friction points this paper is trying to address.

Efficient MoE inference on personal machines with limited GPU memory.

Leveraging activation sparsity to optimize expert cache usage.

Improving per-token latency for MoE-based LLMs on personal devices.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Sparsity-aware expert cache for efficient inference

Dynamic expert replacement and prefetching strategy

Significant latency improvements on personal machines

🔎 Similar Papers

No similar papers found.