MoE-Spec: Expert Budgeting for Efficient Speculative Decoding

📅 2026-02-17

📈 Citations: 0

✨ Influential: 0

career value

192K/year

🤖 AI Summary

This work addresses the substantial memory overhead in Mixture-of-Experts (MoE) models during speculative decoding, which arises from activating a large number of experts and limits acceleration gains. The authors propose a training-free, validation-phase expert budget control method that dynamically selects a fixed number of experts per layer based on their contribution to verification accuracy, thereby decoupling speculation depth from memory consumption for the first time. By strategically pruning experts at inference time, the approach significantly reduces latency while preserving model accuracy. Experimental results across various model scales and datasets demonstrate throughput improvements of 10%–30% over state-of-the-art baselines such as EAGLE-3, with the added flexibility to trade off efficiency and accuracy through adjustable expert budgets.

Technology Category

Application Category

📝 Abstract

Speculative decoding accelerates Large Language Model (LLM) inference by verifying multiple drafted tokens in parallel. However, for Mixture-of-Experts (MoE) models, this parallelism introduces a severe bottleneck: large draft trees activate many unique experts, significantly increasing memory pressure and diminishing speedups from speculative decoding relative to autoregressive decoding. Prior methods reduce speculation depth when MoE verification becomes expensive. We propose MoE-Spec, a training-free verification-time expert budgeting method that decouples speculation depth from memory cost by enforcing a fixed expert capacity limit at each layer, loading only the experts that contribute most to verification and dropping the long tail of rarely used experts that drive bandwidth overhead. Experiments across multiple model scales and datasets show that this method yields 10--30\% higher throughput than state-of-the-art speculative decoding baselines (EAGLE-3) at comparable quality, with flexibility to trade accuracy for further latency reductions through tighter budgets.

Problem

Research questions and friction points this paper is trying to address.

Speculative Decoding

Mixture-of-Experts

Memory Bottleneck

Expert Activation

LLM Inference

Innovation

Methods, ideas, or system contributions that make the work stand out.

Speculative Decoding

Mixture-of-Experts

Expert Budgeting