🤖 AI Summary
Autoregressive (AR) decoding in large language models (LLMs) incurs high inference latency; existing speculative decoding methods still rely on AR drafters, failing to overcome the fundamental sequential bottleneck. Method: We propose DiffuSpec—a training-free, plug-and-play speculative decoding framework that pioneers the use of pre-trained diffusion language models (DLMs) for non-autoregressive, multi-token draft generation. To reconcile the bidirectional nature of DLMs with AR verification, we introduce a token lattice representation, causal consistency path search (CPS), and adaptive draft length control (ADL). DiffuSpec integrates seamlessly with standard AR verifiers without modifying the base LLM. Contribution/Results: Evaluated across multiple benchmarks, DiffuSpec achieves up to 3× speedup in end-to-end latency, significantly improving throughput while preserving accuracy—demonstrating diffusion models’ viability as a novel paradigm for efficient LLM inference.
📝 Abstract
As large language models (LLMs) scale up, accuracy improves, but the autoregressive (AR) nature of decoding increases latency since each token requires a serial forward pass. Speculative decoding addresses this by employing a fast drafter to propose multi-token drafts, which are then verified in parallel by the target model. However, many deployments still rely on AR drafters, where sequential passes limit wall-clock gains. We revisit the drafting stage and present DiffuSpec, a training-free drop-in framework that uses a pretrained diffusion language model (DLM) to produce multi-token drafts in a single forward pass, while remaining compatible with standard AR verifiers. Because DLM drafts are generated under bidirectional conditioning, parallel per-position candidates form a token lattice in which the locally highest-probability token at each position need not form a causal left-to-right path. Moreover, DLM drafting requires pre-specifying a draft length, inducing a speed-quality trade-off. To address these challenges, we introduce two practical components: (i) a causal-consistency path search (CPS) over this lattice that extracts a left-to-right path aligned with AR verification; and (ii) an adaptive draft-length (ADL) controller that adjusts next proposal size based on recent acceptance feedback and realized generated length. Across benchmarks, DiffuSpec yields up to 3x wall-clock speedup, establishing diffusion-based drafting as a robust alternative to autoregressive drafters for speculative decoding.