🤖 AI Summary
MoE models face severe deployment bottlenecks on commodity hardware due to GPU memory constraints and PCIe transfer latency that vastly exceeds computation time. To address this, we propose a prediction-driven expert scheduling system. Our approach comprises three core components: (1) LLaPor, a layer-aware, learnable expert access predictor that accurately models cross-layer expert invocation patterns; (2) PreSched, a prefetch-aware, cross-layer scheduler that generates globally optimal expert weight loading sequences; and (3) AsyncIO, an asynchronous I/O optimization that decouples data loading from computation. Together, these components enable low-overhead, dynamic expert weight loading during inference. Experiments demonstrate that our method achieves a 141% throughput improvement and reduces end-to-end latency by 74.6% over state-of-the-art baselines, significantly enhancing MoE inference efficiency in resource-constrained environments.
📝 Abstract
Mixture-of-Experts (MoE) models face memory and PCIe latency bottlenecks when deployed on commodity hardware. Offloading expert weights to CPU memory results in PCIe transfer latency that exceeds GPU computation by several folds. We present PreScope, a prediction-driven expert scheduling system that addresses three key challenges: inaccurate activation prediction, PCIe bandwidth competition, and cross-device scheduling complexity. Our solution includes: 1) Learnable Layer-Aware Predictor (LLaPor) that captures layer-specific expert activation patterns; 2) Prefetch-Aware Cross-Layer Scheduling (PreSched) that generates globally optimal plans balancing prefetching costs and loading overhead; 3) Asynchronous I/O Optimizer (AsyncIO) that decouples I/O from computation, eliminating waiting bubbles. PreScope achieves 141% higher throughput and 74.6% lower latency than state-of-the-art solutions.