Faster LLM Inference via Sequential Monte Carlo

📅 2026-04-16

📈 Citations: 0

✨ Influential: 0

career value

199K/year

🤖 AI Summary

This work addresses the throughput degradation in conventional speculative decoding caused by token rejection when draft and target models diverge, which limits inference acceleration for large language models. The authors propose Sequential Monte Carlo-based Speculative Decoding (SMC-SD), which replaces the hard per-token rejection mechanism with importance-weighted soft resampling. By leveraging parallel particle generation and vectorized verification, SMC-SD achieves rollback-free, constant-overhead inference through adaptive reweighting. The method maintains over 97% fidelity to the target model’s output distribution while achieving a 2.36× speedup over traditional speculative decoding and a 5.2× speedup relative to autoregressive decoding, substantially enhancing inference efficiency.

Technology Category

Application Category

📝 Abstract

Speculative decoding (SD) accelerates language model inference by drafting tokens from a cheap proposal model and verifying them against an expensive target model via rejection sampling. Because rejection truncates the draft block at the first error, throughput degrades when draft and target diverge. Rather than rejecting draft tokens outright, we propose to reweight them. To this end, we introduce sequential Monte Carlo speculative decoding (SMC-SD), which replaces token-level rejection with importance-weighted resampling over a population of draft particles. SMC-SD is a principled approximate inference scheme that trades exactness for additional speed, while preserving theoretical bounds on its per-step approximation error. Because LLM inference is memory bandwidth-bound, the arithmetic needed to draft particles and to score them in parallel comes nearly for free -- SMC-SD uses idle compute to turn verification into a vectorized, fixed-size operation with no rollback. Empirically, SMC-SD achieves 2.36x speed-up over speculative decoding and a 5.2x speed-up over autoregressive decoding, while remaining within 3% of the target model's accuracy on reasoning, instruction-following, and coding benchmarks.

Problem

Research questions and friction points this paper is trying to address.

speculative decoding

LLM inference

rejection sampling

throughput degradation

draft-target divergence

Innovation

Methods, ideas, or system contributions that make the work stand out.

Sequential Monte Carlo

Speculative Decoding

Importance Weighting