DART: Diffusion-Inspired Speculative Decoding for Fast LLM Inference

πŸ“… 2026-01-27
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work proposes DART, a novel speculative decoding framework that addresses the high latency bottleneck in existing model-based approaches such as EAGLE3, which rely on multi-step autoregressive draft generation. DART introduces, for the first time, the concept of diffusion language models into speculative decoding by enabling non-autoregressive draft generation through a single forward pass that predicts logits for multiple future positions in parallel. The method constructs high-quality draft token trees by integrating multi-position masked modeling conditioned on the target model’s hidden states with an N-gram-constrained semantic continuity tree-pruning algorithm. Experimental results demonstrate that DART achieves end-to-end speedups of 2.03×–3.44Γ— across multiple benchmarks, outperforming EAGLE3 by an average of 30% and significantly enhancing large language model inference efficiency.

Technology Category

Application Category

πŸ“ Abstract
Speculative decoding is an effective and lossless approach for accelerating LLM inference. However, existing widely adopted model-based draft designs, such as EAGLE3, improve accuracy at the cost of multi-step autoregressive inference, resulting in high drafting latency and ultimately rendering the drafting stage itself a performance bottleneck. Inspired by diffusion-based large language models (dLLMs), we propose DART, which leverages parallel generation to reduce drafting latency. DART predicts logits for multiple future masked positions in parallel within a single forward pass based on hidden states of the target model, thereby eliminating autoregressive rollouts in the draft model while preserving a lightweight design. Based on these parallel logit predictions, we further introduce an efficient tree pruning algorithm that constructs high-quality draft token trees with N-gram-enforced semantic continuity. DART substantially reduces draft-stage overhead while preserving high draft accuracy, leading to significantly improved end-to-end decoding speed. Experimental results demonstrate that DART achieves a 2.03x--3.44x wall-clock time speedup across multiple datasets, surpassing EAGLE3 by 30% on average and offering a practical speculative decoding framework. Code is released at https://github.com/fvliang/DART.
Problem

Research questions and friction points this paper is trying to address.

speculative decoding
LLM inference
drafting latency
autoregressive inference
performance bottleneck
Innovation

Methods, ideas, or system contributions that make the work stand out.

speculative decoding
diffusion-inspired
parallel generation
tree pruning
LLM acceleration
πŸ”Ž Similar Papers
No similar papers found.
F
Fuliang Liu
State Key Laboratory of Novel Software Technology, Nanjing University
X
Xue Li
Alibaba Group
K
Ketai Zhao
State Key Laboratory of Novel Software Technology, Nanjing University
Y
Yinxi Gao
State Key Laboratory of Novel Software Technology, Nanjing University
Z
Ziyan Zhou
State Key Laboratory of Novel Software Technology, Nanjing University
Z
Zhonghui Zhang
State Key Laboratory of Novel Software Technology, Nanjing University
Zhibin Wang
Zhibin Wang
Zhejiang University
new particle formationaerosolshygroscopicityblack carbon
W
Wanchun Dou
State Key Laboratory of Novel Software Technology, Nanjing University
Sheng Zhong
Sheng Zhong
Nanjing University
computer networkssecurity and privacytheory of computing
Chen Tian
Chen Tian
Prof. of Nanjing University
Data Center NetworkingNetwork Function VirtualisationContent Distribution