MoE-SpeQ: Speculative Quantized Decoding with Proactive Expert Prefetching and Offloading for Mixture-of-Experts

๐Ÿ“… 2025-11-17
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
MoE inference suffers from PCIe I/O bottlenecks due to dynamic expert selection, severely limiting the feasibility of offloading experts to host memory. To address this, we propose the first system that jointly leverages speculative execution and expert offloading. Our approach introduces a lightweight draft model to predict the sequence of activated experts, a proactive runtime scheduler, and an adaptive controller guided by an amortized Roofline modelโ€”enabling expert prefetching, computation-I/O overlap, and dynamic tuning of speculation policies. By relocating I/O latency off the critical path, our method systematically optimizes data-dependent memory accesses. Evaluated on Phi-MoE, it achieves up to 2.34ร— speedup over the best prior offloading framework, significantly improving MoE inference efficiency and deployment viability on resource-constrained devices.

Technology Category

Application Category

๐Ÿ“ Abstract
The immense memory requirements of state-of-the-art Mixture-of-Experts (MoE) models present a significant challenge for inference, often exceeding the capacity of a single accelerator. While offloading experts to host memory is a common solution, it introduces a severe I/O bottleneck over the PCIe bus, as the data-dependent nature of expert selection places these synchronous transfers directly on the critical path of execution, crippling performance. This paper argues that the I/O bottleneck can be overcome by trading a small amount of cheap, on-device computation to hide the immense cost of data movement. We present MoE-SpeQ, a new inference system built on a novel co-design of speculative execution and expert offloading. MoE-SpeQ employs a small, on-device draft model to predict the sequence of required experts for future tokens. This foresight enables a runtime orchestrator to prefetch these experts from host memory, effectively overlapping the expensive I/O with useful computation and hiding the latency from the critical path. To maximize performance, an adaptive governor, guided by an Amortization Roofline Model, dynamically tunes the speculation strategy to the underlying hardware. Our evaluation on memory-constrained devices shows that for the Phi-MoE model, MoE-SpeQ achieves at most 2.34x speedup over the state-of-the-art offloading framework. Our work establishes a new, principled approach for managing data-dependent memory access in resource-limited environments, making MoE inference more accessible on commodity hardware.
Problem

Research questions and friction points this paper is trying to address.

Reducing I/O bottlenecks in MoE inference caused by expert offloading
Overlapping expert prefetching with computation to hide latency
Optimizing speculation strategy for memory-constrained hardware performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses speculative execution with draft model for prediction
Prefetches experts proactively to overlap I/O with computation
Employs adaptive governor guided by Amortization Roofline Model
๐Ÿ”Ž Similar Papers
No similar papers found.
W
Wenfeng Wang
Shanghai Jiao Tong University, Shanghai, China
J
Jiacheng Liu
Hong Kong University of Science and Technology, Hongkong, China
X
Xiaofeng Hou
Shanghai Jiao Tong University, Shanghai, China
X
Xinfeng Xia
Shanghai Jiao Tong University, China
Peng Tang
Peng Tang
Meta
Multi-modal LLMVision LanguageComputer Vision
M
Mingxuan Zhang
Shanghai Jiao Tong University, Shanghai, China
C
Chao Li
Shanghai Jiao Tong University, Shanghai, China
Minyi Guo
Minyi Guo
IEEE Fellow, Chair Professor, Shanghai Jiao Tong University
Parallel ComputingCompiler OptimizationCloud ComputingNetworkingBig Data