SP-MoE: Speculative Decoding and Prefetching for Accelerating MoE-based Model Inference

📅 2025-10-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address GPU memory bloat and exacerbated CPU-GPU bandwidth contention when integrating Mixture-of-Experts (MoE) models with speculative decoding (SD), this work proposes the first SD-aware expert offloading and compute-communication pipelining architecture. Our method introduces three key innovations: (1) a speculative expert prefetching mechanism that proactively loads potentially activated experts; (2) a deadline-layer control strategy that dynamically caps SD depth to balance verification overhead and throughput gain; and (3) asynchronous batched I/O pipelining that decouples expert loading from computation. Guided by analytical latency modeling for resource co-optimization, our design significantly alleviates memory and bandwidth bottlenecks. Extensive evaluation across multiple MoE models, datasets, and hardware configurations demonstrates end-to-end tokens-per-second (TPOT) improvements of 1.07×–3.5× over state-of-the-art MoE offloading and SD optimization approaches.

Technology Category

Application Category

📝 Abstract
The Mixture-of-Experts (MoE) architecture has been widely adopted in large language models (LLMs) to reduce computation cost through model sparsity. Employing speculative decoding (SD) can further accelerate MoE inference by drafting multiple tokens per step and verifying them in parallel. However, combining MoE with SD inflates GPU memory and aggravates CPU-GPU bandwidth contention during multi-token verification. Existing MoE offloading systems are SD-agnostic and do not address this bottleneck. We present SP-MoE, the first SD-aware expert-offloading and compute-communication pipelining framework. SP-MoE introduces: (1) speculative expert prefetching that exploits structural correspondence between the draft and target models to prefetch likely experts ahead of verification; (2) a cutoff-layer policy that bounds per-layer prefetch depth based on empirical profiles and an analytical latency model, guaranteeing just-in-time availability without overfetch; and (3) a pipelined runtime with asynchronous prefetch threads and batched I/O to hide loading latency. Extensive experiments demonstrate that SP-MoE achieves a 1.07-3.5 times TPOT speedup over state-of-the-art methods across diverse datasets, environments, and MoE-based models.
Problem

Research questions and friction points this paper is trying to address.

Accelerates MoE inference with speculative decoding
Reduces GPU memory inflation during multi-token verification
Minimizes CPU-GPU bandwidth contention via expert prefetching
Innovation

Methods, ideas, or system contributions that make the work stand out.

Speculative expert prefetching using draft-target model correspondence
Cutoff-layer policy bounding prefetch depth with latency model
Pipelined runtime with asynchronous prefetch and batched I/O
🔎 Similar Papers
No similar papers found.