MoE-Lens: Towards the Hardware Limit of High-Throughput MoE LLM Serving Under Resource Constraints

📅 2025-04-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address low inference throughput of Mixture-of-Experts (MoE) large language models caused by GPU memory constraints, this paper proposes a full-stack CPU-GPU co-execution optimization framework. First, we develop a fine-grained performance model integrating hardware characteristics—CPU memory bandwidth, GPU compute capacity, and expert routing load—with system execution semantics to precisely identify bottlenecks and approach theoretical throughput limits. Second, we design a dynamic resource scheduling and computation offloading strategy supporting weight sharding, elastic expert loading, and cross-device pipelined execution. Evaluated across multiple MoE models and datasets, our approach achieves an average 4.6× throughput improvement (up to 25.5×), with performance prediction error under 6% (94% accuracy). This work is the first to enable hardware-aware, full-stack modeling and real-time collaborative scheduling for MoE inference, significantly alleviating the GPU memory wall bottleneck in high-throughput deployment.

Technology Category

Application Category

📝 Abstract
Mixture of Experts (MoE) LLMs, characterized by their sparse activation patterns, offer a promising approach to scaling language models while avoiding proportionally increasing the inference cost. However, their large parameter sizes present deployment challenges in resource-constrained environments with limited GPU memory capacity, as GPU memory is often insufficient to accommodate the full set of model weights. Consequently, typical deployments rely on CPU-GPU hybrid execution: the GPU handles compute-intensive GEMM operations, while the CPU processes the relatively lightweight attention mechanism. This setup introduces a key challenge: how to effectively optimize resource utilization across CPU and GPU? Prior work has designed system optimizations based on performance models with limited scope. Specifically, such models do not capture the complex interactions between hardware properties and system execution mechanisms. Therefore, previous approaches neither identify nor achieve the hardware limit. This paper presents MoE-Lens, a high-throughput MoE LLM inference system designed through holistic performance modeling for resource-constrained environments. Our performance model thoroughly analyzes various fundamental system components, including CPU memory capacity, GPU compute power, and workload characteristics, to understand the theoretical performance upper bound of MoE inference. Furthermore, it captures the system execution mechanisms to identify the key hardware bottlenecks and accurately predict the achievable throughput. Informed by our performance model, MoE-Lens introduces an inference system approaching hardware limits. Evaluated on diverse MoE models and datasets, MoE-Lens outperforms the state-of-the-art solution by 4.6x on average (up to 25.5x), with our theoretical model predicting performance with an average 94% accuracy.
Problem

Research questions and friction points this paper is trying to address.

Optimize CPU-GPU resource use for MoE LLM inference
Achieve hardware performance limits in resource-constrained environments
Model system interactions to predict throughput accurately
Innovation

Methods, ideas, or system contributions that make the work stand out.

Holistic performance modeling for MoE LLM
Optimizes CPU-GPU hybrid execution efficiently
Achieves near-hardware-limit throughput performance
🔎 Similar Papers
No similar papers found.
Y
Yichao Yuan
University of Michigan, Ann Arbor, Michigan, USA
L
Lin Ma
University of Michigan, Ann Arbor, Michigan, USA
Nishil Talati
Nishil Talati
Assistant Research Scientist, University of Michigan
Computer ArchitectureSystemsGenerative AIData Analytics