MoE-Gen: High-Throughput MoE Inference on a Single GPU with Module-Based Batching

๐Ÿ“… 2025-03-12
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Existing MoE inference systems rely on model-level or contiguous batching strategies, which fail to accommodate the heterogeneous computational characteristics of attention and expert modules, resulting in low single-GPU throughput. This work proposes a modular dynamic batching mechanism: tokens are cached in host memory, and each moduleโ€”e.g., attention and expertsโ€”is assigned an independently optimized batch size, enabling aggressive overlap of GPU computation and inter-module communication. Our approach breaks the coarse-grained batching paradigm, enabling the first fine-grained, adaptive, module-level batch scheduling for MoE inference. Evaluated on mainstream models including DeepSeek-MoE and Mixtral, our method achieves 8โ€“31ร— higher throughput than state-of-the-art systems such as FlexGen, MoE-Lightning, and DeepSpeed-MoE, and significantly outperforms contiguous-batch frameworks like vLLM and Ollama.

Technology Category

Application Category

๐Ÿ“ Abstract
This paper presents MoE-Gen, a high-throughput MoE inference system optimized for single-GPU execution. Existing inference systems rely on model-based or continuous batching strategies, originally designed for interactive inference, which result in excessively small batches for MoE's key modules-attention and expert modules-leading to poor throughput. To address this, we introduce module-based batching, which accumulates tokens in host memory and dynamically launches large batches on GPUs to maximize utilization. Additionally, we optimize the choice of batch sizes for each module in an MoE to fully overlap GPU computation and communication, maximizing throughput. Evaluation demonstrates that MoE-Gen achieves 8-31x higher throughput compared to state-of-the-art systems employing model-based batching (FlexGen, MoE-Lightning, DeepSpeed), and offers even greater throughput improvements over continuous batching systems (e.g., vLLM and Ollama) on popular MoE models (DeepSeek and Mixtral) across offline inference tasks. MoE-Gen's source code is publicly available at https://github.com/EfficientMoE/MoE-Gen
Problem

Research questions and friction points this paper is trying to address.

Optimizes MoE inference for single-GPU execution
Introduces module-based batching to maximize GPU utilization
Achieves higher throughput compared to existing systems
Innovation

Methods, ideas, or system contributions that make the work stand out.

Module-based batching for high-throughput MoE inference
Dynamic large batch launches to maximize GPU utilization
Optimized batch sizes to overlap GPU computation and communication
๐Ÿ”Ž Similar Papers
No similar papers found.
T
Tairan Xu
The University of Edinburgh
Leyang Xue
Leyang Xue
University of Edinburgh
Machine Learning SystemMixture-of-ExpertLarge Language Model
Z
Zhan Lu
The University of Edinburgh
A
Adrian Jackson
EPCC, The University of Edinburgh
Luo Mai
Luo Mai
Associate Professor at University of Edinburgh
Computer SystemsMachine LearningData Management