PreScope: Unleashing the Power of Prefetching for Resource-Constrained MoE Inference

📅 2025-09-28

📈 Citations: 0

✨ Influential: 0

career value

222K/year

🤖 AI Summary

MoE models face severe deployment bottlenecks on commodity hardware due to GPU memory constraints and PCIe transfer latency that vastly exceeds computation time. To address this, we propose a prediction-driven expert scheduling system. Our approach comprises three core components: (1) LLaPor, a layer-aware, learnable expert access predictor that accurately models cross-layer expert invocation patterns; (2) PreSched, a prefetch-aware, cross-layer scheduler that generates globally optimal expert weight loading sequences; and (3) AsyncIO, an asynchronous I/O optimization that decouples data loading from computation. Together, these components enable low-overhead, dynamic expert weight loading during inference. Experiments demonstrate that our method achieves a 141% throughput improvement and reduces end-to-end latency by 74.6% over state-of-the-art baselines, significantly enhancing MoE inference efficiency in resource-constrained environments.

Technology Category

Application Category

📝 Abstract

Mixture-of-Experts (MoE) models face memory and PCIe latency bottlenecks when deployed on commodity hardware. Offloading expert weights to CPU memory results in PCIe transfer latency that exceeds GPU computation by several folds. We present PreScope, a prediction-driven expert scheduling system that addresses three key challenges: inaccurate activation prediction, PCIe bandwidth competition, and cross-device scheduling complexity. Our solution includes: 1) Learnable Layer-Aware Predictor (LLaPor) that captures layer-specific expert activation patterns; 2) Prefetch-Aware Cross-Layer Scheduling (PreSched) that generates globally optimal plans balancing prefetching costs and loading overhead; 3) Asynchronous I/O Optimizer (AsyncIO) that decouples I/O from computation, eliminating waiting bubbles. PreScope achieves 141% higher throughput and 74.6% lower latency than state-of-the-art solutions.

Problem

Research questions and friction points this paper is trying to address.

Reducing PCIe transfer latency for offloaded MoE experts

Addressing inaccurate activation prediction in expert scheduling

Optimizing cross-device scheduling complexity for resource constraints

Innovation

Methods, ideas, or system contributions that make the work stand out.

Learnable predictor captures layer-specific expert activation patterns

Prefetch-aware scheduler balances prefetching costs and loading overhead

Asynchronous I/O optimizer decouples I/O from computation operations

🔎 Similar Papers

No similar papers found.