🤖 AI Summary
Speculative decoding faces two key bottlenecks: autoregressive dependency in the draft generation phase limits parallelism, and semantic misalignment between draft and target models leads to high rejection rates. This paper proposes SpecDiff-2, the first framework to integrate discrete diffusion models into the drafting stage of speculative decoding, enabling non-autoregressive, high-quality candidate sequence generation. We further introduce a lightweight model calibration mechanism that explicitly optimizes semantic alignment between drafted tokens and the autoregressive verifier. Evaluated across reasoning, programming, and mathematical benchmarks, SpecDiff-2 achieves state-of-the-art performance, delivering an average throughput improvement of 55% and up to 5.5× speedup over standard autoregressive decoding—without any accuracy degradation. Our core contributions are the deep integration of discrete diffusion modeling into speculative decoding and the establishment of a learnable, cross-model semantic alignment paradigm.
📝 Abstract
Speculative decoding has become the standard approach for accelerating Large Language Model (LLM) inference. It exploits a lossless draft-then-verify procedure to circumvent the latency of autoregressive decoding, achieving impressive speed-ups. Yet, current speculative decoding approaches remain limited by two fundamental bottlenecks: (1) the autoregressive dependency during drafting which limits parallelism, and (2) frequent rejections of draft tokens caused by misalignment between the draft and verify models. This paper proposes SpecDiff-2, a novel framework to jointly address these two bottlenecks. It leverages discrete diffusion as a non-autoregressive drafter to address bottleneck (1) and develops novel techniques to calibrate discrete diffusion drafters with autoregressive verifiers, addressing bottleneck (2). Experimental results across a comprehensive benchmark suite show that SpecDiff-2 achieves a new state-of-the-art across reasoning, coding, and mathematical benchmarks, improving tokens-per-second by up to an average of +55% over previous baselines and obtaining up to 5.5x average speed-up over standard decoding, without any loss of accuracy.