🤖 AI Summary
Existing training-free samplers for masked diffusion language models decide token submission positions sequentially, overlooking the tendency of high-confidence predictions to emerge as contiguous segments, thereby limiting parallelization efficiency. This work proposes CLAD (Confidence-guided Localized Aggregation Decoding), a novel decoding strategy that first identifies continuous high-confidence segments—termed Confidence-Induced Clusters (CICs)—via confidence-guided clustering and then leverages self-attention maps to assess inter-cluster dependencies, enabling conflict-aware, cluster-level parallel submission. CLAD introduces, for the first time, a cluster-wise parallelism mechanism that requires no modification to model training. Evaluated on LLaDA and Dream models, it achieves speedups of 1.77× to 8.47× while largely preserving generation quality comparable to original sequential decoding across most tasks.
📝 Abstract
Masked diffusion language models (MDLMs) enable parallel decoding by predicting all masked positions at each denoising step, yet existing training-free samplers usually decide which positions to commit at token-level granularity. We revisit this granularity and observe that reliable predictions often emerge as contiguous high-confidence spans, suggesting that the unit of parallel commitment can be larger than a single token. We first group adjacent high-confidence candidates into confidence-induced clusters (CICs) as span-level update units. We then use self-attention maps from the same forward pass to estimate inter-cluster dependencies, enabling conflict-aware selection of mutually compatible CICs for parallel commitment. This yields CLAD (Cluster-Level Attention-Guided Decoding), a training-free cluster-level decoder for MDLMs. Experiments on LLaDA and Dream model families across four reasoning and code-generation benchmarks show that CLAD achieves 1.77x--8.47x speedups over Vanilla decoding while maintaining broadly comparable task accuracy in most settings.