MoE-Spec: Expert Budgeting for Efficient Speculative Decoding

📅 2026-02-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the substantial memory overhead in Mixture-of-Experts (MoE) models during speculative decoding, which arises from activating a large number of experts and limits acceleration gains. The authors propose a training-free, validation-phase expert budget control method that dynamically selects a fixed number of experts per layer based on their contribution to verification accuracy, thereby decoupling speculation depth from memory consumption for the first time. By strategically pruning experts at inference time, the approach significantly reduces latency while preserving model accuracy. Experimental results across various model scales and datasets demonstrate throughput improvements of 10%–30% over state-of-the-art baselines such as EAGLE-3, with the added flexibility to trade off efficiency and accuracy through adjustable expert budgets.

Technology Category

Application Category

📝 Abstract
Speculative decoding accelerates Large Language Model (LLM) inference by verifying multiple drafted tokens in parallel. However, for Mixture-of-Experts (MoE) models, this parallelism introduces a severe bottleneck: large draft trees activate many unique experts, significantly increasing memory pressure and diminishing speedups from speculative decoding relative to autoregressive decoding. Prior methods reduce speculation depth when MoE verification becomes expensive. We propose MoE-Spec, a training-free verification-time expert budgeting method that decouples speculation depth from memory cost by enforcing a fixed expert capacity limit at each layer, loading only the experts that contribute most to verification and dropping the long tail of rarely used experts that drive bandwidth overhead. Experiments across multiple model scales and datasets show that this method yields 10--30\% higher throughput than state-of-the-art speculative decoding baselines (EAGLE-3) at comparable quality, with flexibility to trade accuracy for further latency reductions through tighter budgets.
Problem

Research questions and friction points this paper is trying to address.

Speculative Decoding
Mixture-of-Experts
Memory Bottleneck
Expert Activation
LLM Inference
Innovation

Methods, ideas, or system contributions that make the work stand out.

Speculative Decoding
Mixture-of-Experts
Expert Budgeting
Training-Free Optimization
Efficient Inference
🔎 Similar Papers
No similar papers found.
Bradley McDanel
Bradley McDanel
Assistant Professor of Computer Science, Franklin & Marshall College
Machine LearningEfficient Deep LearningDynamic Neural NetworksComputer Architecture
S
Steven Li
Meta Reality Labs
S
Sruthikesh Surineni
Meta Reality Labs
H
Harshit Khaitan
Meta Reality Labs