HybriMoE: Hybrid CPU-GPU Scheduling and Cache Management for Efficient MoE Inference

📅 2025-04-08

📈 Citations: 0

✨ Influential: 0

career value

263K/year

🤖 AI Summary

Large-scale Mixture-of-Experts (MoE) models face three core challenges during inference on resource-constrained platforms: high expert offloading overhead, complex CPU–GPU co-scheduling, and highly unstable expert activation patterns. To address these, this work proposes (1) dynamic intra-layer scheduling, (2) influence-driven cross-layer prefetching, and (3) a scoring-based heterogeneous caching mechanism—achieving, for the first time, adaptive co-optimization for irregular expert distributions and transient activation patterns. Built upon the kTransformers framework, our approach integrates dynamic load balancing, multi-level cache management, and expert-aware prefetching to enable efficient heterogeneous hardware collaboration. Evaluated on three mainstream MoE models, it delivers average speedups of 1.33× during prefill and 1.70× during decode—outperforming existing hybrid inference solutions.

Technology Category

Application Category

📝 Abstract

The Mixture of Experts (MoE) architecture has demonstrated significant advantages as it enables to increase the model capacity without a proportional increase in computation. However, the large MoE model size still introduces substantial memory demands, which usually requires expert offloading on resource-constrained platforms and incurs significant overhead. Hybrid CPU-GPU inference has been proposed to leverage CPU computation to reduce expert loading overhead but faces major challenges: on one hand, the expert activation patterns of MoE models are highly unstable, rendering the fixed mapping strategies in existing works inefficient; on the other hand, the hybrid CPU-GPU schedule for MoE is inherently complex due to the diverse expert sizes, structures, uneven workload distribution, etc. To address these challenges, in this paper, we propose HybriMoE, a hybrid CPU-GPU inference framework that improves resource utilization through a novel CPU-GPU scheduling and cache management system. HybriMoE introduces (i) a dynamic intra-layer scheduling strategy to balance workloads across CPU and GPU, (ii) an impact-driven inter-layer prefetching algorithm, and (iii) a score-based caching algorithm to mitigate expert activation instability. We implement HybriMoE on top of the kTransformers framework and evaluate it on three widely used MoE-based LLMs. Experimental results demonstrate that HybriMoE achieves an average speedup of 1.33$ imes$ in the prefill stage and 1.70$ imes$ in the decode stage compared to state-of-the-art hybrid MoE inference framework. Our code is available at: https://github.com/PKU-SEC-Lab/HybriMoE.

Problem

Research questions and friction points this paper is trying to address.

Optimize CPU-GPU scheduling for efficient MoE inference

Manage expert activation instability in hybrid computing

Reduce memory overhead in large MoE models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic intra-layer scheduling balances CPU-GPU workloads

Impact-driven inter-layer prefetching enhances efficiency

Score-based caching mitigates expert activation instability

🔎 Similar Papers

Fiddler: CPU-GPU Orchestration for Fast Inference of Mixture-of-Experts Models