Speculative MoE: Communication Efficient Parallel MoE Inference with Speculative Token and Expert Pre-scheduling

📅 2025-03-06

📈 Citations: 0

✨ Influential: 0

career value

268K/year

🤖 AI Summary

All-to-all communication in expert parallelism (EP) imposes severe overhead during inference for large Mixture-of-Experts (MoE) models. Method: We propose a prediction-driven parallel optimization framework that systematically integrates speculative parallelization into MoE inference—introducing speculative token reordering and expert-group pre-scheduling to losslessly compress EP communication volume without accuracy degradation. Our approach jointly leverages routing path prediction, dynamic expert topology pre-construction, and asynchronous pre-scheduling execution, and is deeply integrated with DeepSpeed-MoE and SGLang. Contribution/Results: Experiments on both homogeneous and heterogeneous networks demonstrate up to 72% reduction in EP communication volume, an average 31% decrease in end-to-end latency, and substantial improvements in throughput and latency-constrained inference efficiency—all achieved with zero precision loss.

Technology Category

Application Category

📝 Abstract

MoE (Mixture of Experts) prevails as a neural architecture that can scale modern transformer-based LLMs (Large Language Models) to unprecedented scales. Nevertheless, large MoEs' great demands of computing power, memory capacity and memory bandwidth make scalable serving a fundamental challenge and efficient parallel inference has become a requisite to attain adequate throughput under latency constraints. DeepSpeed-MoE, one state-of-the-art MoE inference framework, adopts a 3D-parallel paradigm including EP (Expert Parallelism), TP (Tensor Parallel) and DP (Data Parallelism). However, our analysis shows DeepSpeed-MoE's inference efficiency is largely bottlenecked by EP, which is implemented with costly all-to-all collectives to route token activation. Our work aims to boost DeepSpeed-MoE by strategically reducing EP's communication overhead with a technique named Speculative MoE. Speculative MoE has two speculative parallelization schemes, speculative token shuffling and speculative expert grouping, which predict outstanding tokens' expert routing paths and pre-schedule tokens and experts across devices to losslessly trim EP's communication volume. Besides DeepSpeed-MoE, we also build Speculative MoE into a prevailing MoE inference engine SGLang. Experiments show Speculative MoE can significantly boost state-of-the-art MoE inference frameworks on fast homogeneous and slow heterogeneous interconnects.

Problem

Research questions and friction points this paper is trying to address.

Reduces communication overhead in MoE inference

Improves DeepSpeed-MoE's efficiency via speculative techniques

Enhances throughput under latency constraints for large MoEs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Speculative token shuffling reduces communication overhead.

Speculative expert grouping pre-schedules tokens and experts.

Enhances MoE inference efficiency on various interconnects.

🔎 Similar Papers

No similar papers found.