DAWN: Dependency-Aware Fast Inference for Diffusion LLMs

📅 2026-02-06

📈 Citations: 0

✨ Influential: 0

career value

210K/year

🤖 AI Summary

Existing parallel decoding methods for diffusion-based large language models often suffer from significant quality degradation due to neglecting semantic dependencies among tokens, making it challenging to balance speed and generation fidelity. This work proposes DAWN, a training-free, dependency-aware decoding framework that explicitly models semantic dependencies among tokens in diffusion LLMs for the first time. By constructing a dependency graph based on these relationships, DAWN dynamically selects high-confidence positions for mask removal during decoding. This approach breaks the independence assumption inherent in conventional parallel decoding strategies, substantially improving generation quality while preserving high parallelism. Experimental results demonstrate that DAWN achieves 1.80–8.06× inference speedup across multiple models and datasets with negligible loss in generation quality.

Technology Category

Application Category

📝 Abstract

Diffusion large language models (dLLMs) have shown advantages in text generation, particularly due to their inherent ability for parallel decoding. However, constrained by the quality--speed trade-off, existing inference solutions adopt conservative parallel strategies, leaving substantial efficiency potential underexplored. A core challenge is that parallel decoding assumes each position can be filled independently, but tokens are often semantically coupled. Thus, the correct choice at one position constrains valid choices at others. Without modeling these inter-token dependencies, parallel strategies produce deteriorated outputs. Motivated by this insight, we propose DAWN, a training-free, dependency-aware decoding method for fast dLLM inference. DAWN extracts token dependencies and leverages two key motivations: (1) positions dependent on unmasked certain positions become more reliable, (2) simultaneously unmasking strongly coupled uncertain positions induces errors. Given those findings, DAWN leverages a dependency graph to select more reliable unmasking positions at each iteration, achieving high parallelism with negligible loss in generation quality. Extensive experiments across multiple models and datasets demonstrate that DAWN speedups the inference by 1.80-8.06x over baselines while preserving the generation quality. Code is released at https://github.com/lizhuo-luo/DAWN.

Problem

Research questions and friction points this paper is trying to address.

diffusion LLMs

parallel decoding

token dependencies

inference efficiency

generation quality

Innovation

Methods, ideas, or system contributions that make the work stand out.

dependency-aware decoding

diffusion LLMs

parallel decoding