Cluster-Level Attention-Guided Parallel Decoding for Masked Diffusion Language Models

📅 2026-05-28

📈 Citations: 0

✨ Influential: 0

career value

209K/year

🤖 AI Summary

Existing training-free samplers for masked diffusion language models decide token submission positions sequentially, overlooking the tendency of high-confidence predictions to emerge as contiguous segments, thereby limiting parallelization efficiency. This work proposes CLAD (Confidence-guided Localized Aggregation Decoding), a novel decoding strategy that first identifies continuous high-confidence segments—termed Confidence-Induced Clusters (CICs)—via confidence-guided clustering and then leverages self-attention maps to assess inter-cluster dependencies, enabling conflict-aware, cluster-level parallel submission. CLAD introduces, for the first time, a cluster-wise parallelism mechanism that requires no modification to model training. Evaluated on LLaDA and Dream models, it achieves speedups of 1.77× to 8.47× while largely preserving generation quality comparable to original sequential decoding across most tasks.

📝 Abstract

Masked diffusion language models (MDLMs) enable parallel decoding by predicting all masked positions at each denoising step, yet existing training-free samplers usually decide which positions to commit at token-level granularity. We revisit this granularity and observe that reliable predictions often emerge as contiguous high-confidence spans, suggesting that the unit of parallel commitment can be larger than a single token. We first group adjacent high-confidence candidates into confidence-induced clusters (CICs) as span-level update units. We then use self-attention maps from the same forward pass to estimate inter-cluster dependencies, enabling conflict-aware selection of mutually compatible CICs for parallel commitment. This yields CLAD (Cluster-Level Attention-Guided Decoding), a training-free cluster-level decoder for MDLMs. Experiments on LLaDA and Dream model families across four reasoning and code-generation benchmarks show that CLAD achieves 1.77x--8.47x speedups over Vanilla decoding while maintaining broadly comparable task accuracy in most settings.

Problem

Research questions and friction points this paper is trying to address.

masked diffusion language models

parallel decoding

token-level granularity

confidence spans

commitment units

Innovation

Methods, ideas, or system contributions that make the work stand out.

cluster-level decoding

masked diffusion language models

parallel decoding