HybriMoE: Hybrid CPU-GPU Scheduling and Cache Management for Efficient MoE Inference

📅 2025-04-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large-scale Mixture-of-Experts (MoE) models face three core challenges during inference on resource-constrained platforms: high expert offloading overhead, complex CPU–GPU co-scheduling, and highly unstable expert activation patterns. To address these, this work proposes (1) dynamic intra-layer scheduling, (2) influence-driven cross-layer prefetching, and (3) a scoring-based heterogeneous caching mechanism—achieving, for the first time, adaptive co-optimization for irregular expert distributions and transient activation patterns. Built upon the kTransformers framework, our approach integrates dynamic load balancing, multi-level cache management, and expert-aware prefetching to enable efficient heterogeneous hardware collaboration. Evaluated on three mainstream MoE models, it delivers average speedups of 1.33× during prefill and 1.70× during decode—outperforming existing hybrid inference solutions.

Technology Category

Application Category

📝 Abstract
The Mixture of Experts (MoE) architecture has demonstrated significant advantages as it enables to increase the model capacity without a proportional increase in computation. However, the large MoE model size still introduces substantial memory demands, which usually requires expert offloading on resource-constrained platforms and incurs significant overhead. Hybrid CPU-GPU inference has been proposed to leverage CPU computation to reduce expert loading overhead but faces major challenges: on one hand, the expert activation patterns of MoE models are highly unstable, rendering the fixed mapping strategies in existing works inefficient; on the other hand, the hybrid CPU-GPU schedule for MoE is inherently complex due to the diverse expert sizes, structures, uneven workload distribution, etc. To address these challenges, in this paper, we propose HybriMoE, a hybrid CPU-GPU inference framework that improves resource utilization through a novel CPU-GPU scheduling and cache management system. HybriMoE introduces (i) a dynamic intra-layer scheduling strategy to balance workloads across CPU and GPU, (ii) an impact-driven inter-layer prefetching algorithm, and (iii) a score-based caching algorithm to mitigate expert activation instability. We implement HybriMoE on top of the kTransformers framework and evaluate it on three widely used MoE-based LLMs. Experimental results demonstrate that HybriMoE achieves an average speedup of 1.33$ imes$ in the prefill stage and 1.70$ imes$ in the decode stage compared to state-of-the-art hybrid MoE inference framework. Our code is available at: https://github.com/PKU-SEC-Lab/HybriMoE.
Problem

Research questions and friction points this paper is trying to address.

Optimize CPU-GPU scheduling for efficient MoE inference
Manage expert activation instability in hybrid computing
Reduce memory overhead in large MoE models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic intra-layer scheduling balances CPU-GPU workloads
Impact-driven inter-layer prefetching enhances efficiency
Score-based caching mitigates expert activation instability
🔎 Similar Papers
No similar papers found.
Shuzhang Zhong
Shuzhang Zhong
Peking University
Machine Learning System
Y
Yanfan Sun
School of Computer Science and Engineering, Beihang University, Beijing, China
Ling Liang
Ling Liang
pku.edu.cn
R
Runsheng Wang
School of Integrated Circuits, Peking University, Beijing, China; Institute of Electronic Design Automation, Peking University, Wuxi, China; Beijing Advanced Innovation Center for Integrated Circuits, Beijing, China
R
Ru Huang
School of Integrated Circuits, Peking University, Beijing, China; Institute of Electronic Design Automation, Peking University, Wuxi, China; Beijing Advanced Innovation Center for Integrated Circuits, Beijing, China
M
Meng Li
Institute for Artificial Intelligence, Peking University, Beijing, China; School of Integrated Circuits, Peking University, Beijing, China; Beijing Advanced Innovation Center for Integrated Circuits, Beijing, China