Making Every Verified Token Count: Adaptive Verification for MoE Speculative Decoding

📅 2026-04-30
📈 Citations: 0
Influential: 0
📄 PDF

career value

218K/year
🤖 AI Summary
This work addresses the high verification overhead in tree-based speculative decoding for sparse Mixture-of-Experts (MoE) models, where divergent expert activation across tree branches severely undermines acceleration gains. To mitigate this, the authors propose EVICT, a training-free, hyperparameter-free, and performance-lossless adaptive verification mechanism. EVICT dynamically prunes candidate decoding trees by jointly optimizing fine-grained drafter signals with offline expert activation costs, retaining only high-efficiency prefixes. Compatible with the SGLang graph inference framework, EVICT achieves up to 2.35× speedup over autoregressive decoding across diverse MoE models and benchmarks, and is on average 1.21× faster than EAGLE-3, substantially reducing redundant expert activations.
📝 Abstract
Tree-based speculative decoding accelerates autoregressive generation by verifying multiple draft candidates in parallel, but this advantage weakens for sparse Mixture-of-Experts (MoE) models. As the draft tree grows, different branches activate different experts, expanding the union of activated experts and substantially increasing target-side verification cost. We propose EVICT, a training-free, hyperparameter-free, and lossless adaptive verification method for MoE speculative decoding. EVICT makes every verified token count by truncating the draft tree before target verification and retaining only the cost-effective prefix. It leverages fine-grained drafter signals to estimate candidate benefit, combines them with offline-profiled verification cost, and remains highly compatible with the high-performance graph-based serving framework SGLang. Extensive experiments on diverse MoE backbones and benchmarks show that EVICT achieves up to 2.35x speedup over autoregressive decoding and an average 1.21x speedup over the state-of-the-art baseline EAGLE-3, while significantly reducing unnecessary expert activations during verification.
Problem

Research questions and friction points this paper is trying to address.

Mixture-of-Experts
speculative decoding
expert activation
verification cost
autoregressive generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

speculative decoding
Mixture-of-Experts
adaptive verification
expert activation
tree-based decoding
🔎 Similar Papers