DART: Diffusion-Inspired Speculative Decoding for Fast LLM Inference

📅 2026-01-27

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

This work proposes DART, a novel speculative decoding framework that addresses the high latency bottleneck in existing model-based approaches such as EAGLE3, which rely on multi-step autoregressive draft generation. DART introduces, for the first time, the concept of diffusion language models into speculative decoding by enabling non-autoregressive draft generation through a single forward pass that predicts logits for multiple future positions in parallel. The method constructs high-quality draft token trees by integrating multi-position masked modeling conditioned on the target model’s hidden states with an N-gram-constrained semantic continuity tree-pruning algorithm. Experimental results demonstrate that DART achieves end-to-end speedups of 2.03×–3.44× across multiple benchmarks, outperforming EAGLE3 by an average of 30% and significantly enhancing large language model inference efficiency.

Technology Category

Application Category

📝 Abstract

Speculative decoding is an effective and lossless approach for accelerating LLM inference. However, existing widely adopted model-based draft designs, such as EAGLE3, improve accuracy at the cost of multi-step autoregressive inference, resulting in high drafting latency and ultimately rendering the drafting stage itself a performance bottleneck. Inspired by diffusion-based large language models (dLLMs), we propose DART, which leverages parallel generation to reduce drafting latency. DART predicts logits for multiple future masked positions in parallel within a single forward pass based on hidden states of the target model, thereby eliminating autoregressive rollouts in the draft model while preserving a lightweight design. Based on these parallel logit predictions, we further introduce an efficient tree pruning algorithm that constructs high-quality draft token trees with N-gram-enforced semantic continuity. DART substantially reduces draft-stage overhead while preserving high draft accuracy, leading to significantly improved end-to-end decoding speed. Experimental results demonstrate that DART achieves a 2.03x--3.44x wall-clock time speedup across multiple datasets, surpassing EAGLE3 by 30% on average and offering a practical speculative decoding framework. Code is released at https://github.com/fvliang/DART.

Problem

Research questions and friction points this paper is trying to address.

speculative decoding

LLM inference

drafting latency

autoregressive inference

performance bottleneck

Innovation

Methods, ideas, or system contributions that make the work stand out.

speculative decoding

diffusion-inspired

parallel generation