🤖 AI Summary
Speculative inference (SI) often incurs higher latency than standard autoregressive (non-SI) decoding due to inefficiencies in draft models, undermining its core motivation.
Method: We propose Distributed Speculative Inference (DSI), a novel algorithm that introduces “speculative parallelism” (SP)—a paradigm enabling lossless generation strictly superior to both SI and non-SI—without modifying the frozen target model. DSI integrates distributed task scheduling, cooperative inference between frozen models, and theory-driven latency modeling.
Contribution/Results: We theoretically prove DSI achieves latency reduction over SI for *any* draft model and characterize the fundamental trade-off between computational resources and end-to-end latency. Empirical evaluation on single-node setups shows DSI accelerates inference by 1.29×–1.92× over state-of-the-art SI methods. All code is publicly released.
📝 Abstract
This paper introduces distributed speculative inference (DSI), a novel inference algorithm that is provably faster than speculative inference (SI) [leviathan2023, chen2023, miao2024, sun2025, timor2025] and standard autoregressive inference (non-SI). Like other SI algorithms, DSI operates on frozen language models (LMs), requiring no training or architectural modifications, and it preserves the target distribution. Prior studies on SI have demonstrated empirical speedups over non-SI--but rely on sufficiently fast and accurate drafters, which are often unavailable in practice. We identify a gap where SI can be slower than non-SI if drafters are too slow or inaccurate. We close this gap by proving that DSI is faster than both SI and non-SI--given any drafters. DSI is therefore not only faster than SI, but also unlocks the acceleration of LMs for which SI fails. DSI leverages speculation parallelism (SP), a novel type of task parallelism, to orchestrate target and drafter instances that overlap in time, establishing a new foundational tradeoff between computational resources and latency. Our simulations show that DSI is 1.29-1.92x faster than SI in single-node setups for various off-the-shelf LMs and tasks. We open-source all our code.