Dependency-Guided Parallel Decoding in Discrete Diffusion Language Models

📅 2026-04-02

📈 Citations: 0

✨ Influential: 0

career value

183K/year

🤖 AI Summary

Discrete diffusion language models suffer from distributional mismatch and degraded generation quality in parallel decoding due to the neglect of inter-token dependencies. This work proposes DEMASK, a novel approach that introduces, for the first time, a lightweight dependency predictor to model conditional influences among masked positions and employs a greedy algorithm to select positions with bounded cumulative dependency for parallel decoding. Theoretical analysis demonstrates that this strategy effectively bounds the total variation distance between parallel sampling and the true joint distribution. Experiments on the Dream-7B model show that DEMASK achieves a 1.7–2.2× speedup while matching or outperforming existing baselines based on confidence scores and KL divergence in terms of generation accuracy.

Technology Category

Application Category

📝 Abstract

Discrete diffusion language models (dLLMs) accelerate text generation by unmasking multiple tokens in parallel. However, parallel decoding introduces a distributional mismatch: it approximates the joint conditional using a fully factorized product of per-token marginals, which degrades output quality when selected tokens are strongly dependent. We propose DEMASK (DEpendency-guided unMASKing), a lightweight dependency predictor that attaches to the final hidden states of a dLLM. In a single forward pass, it estimates pairwise conditional influences between masked positions. Using these predictions, a greedy selection algorithm identifies positions with bounded cumulative dependency for simultaneous unmasking. Under a sub-additivity assumption, we prove this bounds the total variation distance between our parallel sampling and the model's joint. Empirically, DEMASK achieves 1.7-2.2$\times$ speedup on Dream-7B while matching or improving accuracy compared to confidence-based and KL-based baselines.

Problem

Research questions and friction points this paper is trying to address.

discrete diffusion language models

parallel decoding

distributional mismatch

token dependency

joint conditional distribution

Innovation

Methods, ideas, or system contributions that make the work stand out.

discrete diffusion language models

parallel decoding

dependency prediction