MoESD: Unveil Speculative Decoding's Potential for Accelerating Sparse MoE

📅 2025-05-26

📈 Citations: 0

✨ Influential: 0

career value

208K/year

🤖 AI Summary

Speculative decoding (SD) is widely assumed to benefit only dense large language models (LLMs), with its applicability to sparse Mixture-of-Experts (MoE) models remaining unexplored. Method: This work systematically investigates SD for MoE inference acceleration, introducing “target efficiency” — a novel metric quantifying system-level bottlenecks under hardware and sparsity constraints — and establishing the first theoretical SD acceleration model tailored to MoE architectures. It jointly optimizes algorithmic adaptation, sparsity patterns, batch size, and hardware utilization. Contribution/Results: We demonstrate that MoE models achieve higher speedups than dense counterparts at moderate batch sizes, with the advantage widening as expert sparsity increases and effective batch-size range expands. Empirical validation on Qwen2-57B-A14B confirms up to 2.29× end-to-end inference acceleration. Our framework provides a new, efficient paradigm for MoE acceleration—particularly valuable for private deployment scenarios requiring low-latency, resource-constrained inference.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) have achieved remarkable success across many applications, with Mixture of Experts (MoE) models demonstrating great potential. Compared to traditional dense models, MoEs achieve better performance with less computation. Speculative decoding (SD) is a widely used technique to accelerate LLM inference without accuracy loss, but it has been considered efficient only for dense models. In this work, we first demonstrate that, under medium batch sizes, MoE surprisingly benefits more from SD than dense models. Furthermore, as MoE becomes sparser -- the prevailing trend in MoE designs -- the batch size range where SD acceleration is expected to be effective becomes broader. To quantitatively understand tradeoffs involved in SD, we develop a reliable modeling based on theoretical analyses. While current SD research primarily focuses on improving acceptance rates of algorithms, changes in workload and model architecture can still lead to degraded SD acceleration even with high acceptance rates. To address this limitation, we introduce a new metric 'target efficiency' that characterizes these effects, thus helping researchers identify system bottlenecks and understand SD acceleration more comprehensively. For scenarios like private serving, this work unveils a new perspective to speed up MoE inference, where existing solutions struggle. Experiments on different GPUs show up to 2.29x speedup for Qwen2-57B-A14B at medium batch sizes and validate our theoretical predictions.

Problem

Research questions and friction points this paper is trying to address.

Accelerating sparse MoE models using speculative decoding

Understanding tradeoffs in speculative decoding for MoE architectures

Introducing target efficiency metric to optimize SD acceleration

Innovation

Methods, ideas, or system contributions that make the work stand out.

Speculative decoding accelerates sparse MoE models

New metric 'target efficiency' identifies bottlenecks

MoE benefits more from SD than dense models

🔎 Similar Papers

Turning Trash into Treasure: Accelerating Inference of Large Language Models with Token Recycling