π€ AI Summary
This work proposes DART, a novel speculative decoding framework that addresses the high latency bottleneck in existing model-based approaches such as EAGLE3, which rely on multi-step autoregressive draft generation. DART introduces, for the first time, the concept of diffusion language models into speculative decoding by enabling non-autoregressive draft generation through a single forward pass that predicts logits for multiple future positions in parallel. The method constructs high-quality draft token trees by integrating multi-position masked modeling conditioned on the target modelβs hidden states with an N-gram-constrained semantic continuity tree-pruning algorithm. Experimental results demonstrate that DART achieves end-to-end speedups of 2.03Γβ3.44Γ across multiple benchmarks, outperforming EAGLE3 by an average of 30% and significantly enhancing large language model inference efficiency.
π Abstract
Speculative decoding is an effective and lossless approach for accelerating LLM inference. However, existing widely adopted model-based draft designs, such as EAGLE3, improve accuracy at the cost of multi-step autoregressive inference, resulting in high drafting latency and ultimately rendering the drafting stage itself a performance bottleneck. Inspired by diffusion-based large language models (dLLMs), we propose DART, which leverages parallel generation to reduce drafting latency. DART predicts logits for multiple future masked positions in parallel within a single forward pass based on hidden states of the target model, thereby eliminating autoregressive rollouts in the draft model while preserving a lightweight design. Based on these parallel logit predictions, we further introduce an efficient tree pruning algorithm that constructs high-quality draft token trees with N-gram-enforced semantic continuity. DART substantially reduces draft-stage overhead while preserving high draft accuracy, leading to significantly improved end-to-end decoding speed. Experimental results demonstrate that DART achieves a 2.03x--3.44x wall-clock time speedup across multiple datasets, surpassing EAGLE3 by 30% on average and offering a practical speculative decoding framework. Code is released at https://github.com/fvliang/DART.