🤖 AI Summary
All-to-all communication in expert parallelism (EP) imposes severe overhead during inference for large Mixture-of-Experts (MoE) models. Method: We propose a prediction-driven parallel optimization framework that systematically integrates speculative parallelization into MoE inference—introducing speculative token reordering and expert-group pre-scheduling to losslessly compress EP communication volume without accuracy degradation. Our approach jointly leverages routing path prediction, dynamic expert topology pre-construction, and asynchronous pre-scheduling execution, and is deeply integrated with DeepSpeed-MoE and SGLang. Contribution/Results: Experiments on both homogeneous and heterogeneous networks demonstrate up to 72% reduction in EP communication volume, an average 31% decrease in end-to-end latency, and substantial improvements in throughput and latency-constrained inference efficiency—all achieved with zero precision loss.
📝 Abstract
MoE (Mixture of Experts) prevails as a neural architecture that can scale modern transformer-based LLMs (Large Language Models) to unprecedented scales. Nevertheless, large MoEs' great demands of computing power, memory capacity and memory bandwidth make scalable serving a fundamental challenge and efficient parallel inference has become a requisite to attain adequate throughput under latency constraints. DeepSpeed-MoE, one state-of-the-art MoE inference framework, adopts a 3D-parallel paradigm including EP (Expert Parallelism), TP (Tensor Parallel) and DP (Data Parallelism). However, our analysis shows DeepSpeed-MoE's inference efficiency is largely bottlenecked by EP, which is implemented with costly all-to-all collectives to route token activation. Our work aims to boost DeepSpeed-MoE by strategically reducing EP's communication overhead with a technique named Speculative MoE. Speculative MoE has two speculative parallelization schemes, speculative token shuffling and speculative expert grouping, which predict outstanding tokens' expert routing paths and pre-schedule tokens and experts across devices to losslessly trim EP's communication volume. Besides DeepSpeed-MoE, we also build Speculative MoE into a prevailing MoE inference engine SGLang. Experiments show Speculative MoE can significantly boost state-of-the-art MoE inference frameworks on fast homogeneous and slow heterogeneous interconnects.