SpecDiff-2: Scaling Diffusion Drafter Alignment For Faster Speculative Decoding

📅 2025-11-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Speculative decoding faces two key bottlenecks: autoregressive dependency in the draft generation phase limits parallelism, and semantic misalignment between draft and target models leads to high rejection rates. This paper proposes SpecDiff-2, the first framework to integrate discrete diffusion models into the drafting stage of speculative decoding, enabling non-autoregressive, high-quality candidate sequence generation. We further introduce a lightweight model calibration mechanism that explicitly optimizes semantic alignment between drafted tokens and the autoregressive verifier. Evaluated across reasoning, programming, and mathematical benchmarks, SpecDiff-2 achieves state-of-the-art performance, delivering an average throughput improvement of 55% and up to 5.5× speedup over standard autoregressive decoding—without any accuracy degradation. Our core contributions are the deep integration of discrete diffusion modeling into speculative decoding and the establishment of a learnable, cross-model semantic alignment paradigm.

Technology Category

Application Category

📝 Abstract
Speculative decoding has become the standard approach for accelerating Large Language Model (LLM) inference. It exploits a lossless draft-then-verify procedure to circumvent the latency of autoregressive decoding, achieving impressive speed-ups. Yet, current speculative decoding approaches remain limited by two fundamental bottlenecks: (1) the autoregressive dependency during drafting which limits parallelism, and (2) frequent rejections of draft tokens caused by misalignment between the draft and verify models. This paper proposes SpecDiff-2, a novel framework to jointly address these two bottlenecks. It leverages discrete diffusion as a non-autoregressive drafter to address bottleneck (1) and develops novel techniques to calibrate discrete diffusion drafters with autoregressive verifiers, addressing bottleneck (2). Experimental results across a comprehensive benchmark suite show that SpecDiff-2 achieves a new state-of-the-art across reasoning, coding, and mathematical benchmarks, improving tokens-per-second by up to an average of +55% over previous baselines and obtaining up to 5.5x average speed-up over standard decoding, without any loss of accuracy.
Problem

Research questions and friction points this paper is trying to address.

Addressing autoregressive dependency limiting parallelism in drafting
Reducing frequent draft token rejections from model misalignment
Accelerating LLM inference while maintaining lossless decoding accuracy
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses discrete diffusion as non-autoregressive drafter
Calibrates diffusion drafters with autoregressive verifiers
Achieves speed-up without accuracy loss
🔎 Similar Papers
No similar papers found.
Jameson Sandler
Jameson Sandler
PhD Student, University of Virginia
Accelerated LLMsAI ReasoningRL
J
Jacob K. Christopher
Department of Computer Science, University of Virginia, Charlottesville, USA
T
Thomas Hartvigsen
Department of Computer Science, University of Virginia, Charlottesville, USA
N
Nando Fioretto
Department of Computer Science, University of Virginia, Charlottesville, USA