Plan for Speed -- Dilated Scheduling for Masked Diffusion Language Models

📅 2025-06-23

📈 Citations: 0

✨ Influential: 0

career value

213K/year

🤖 AI Summary

Existing masked diffusion language models (MDLMs) rely on token-wise confidence or entropy heuristics for parallel denoising, ignoring pairwise token dependencies—leading to low inference efficiency and latency comparable to autoregressive models. To address this, we propose DUS (Dynamic Unmasking Scheduler): a method that designs non-adjacent token grouping via an expanded scheduling scheme, leveraging a first-order Markov assumption to reduce joint entropy while preserving the original denoiser; it further introduces a local context-aware grouping mechanism that jointly optimizes decisions using both confidence and entropy scores. DUS requires only O(log B) denoising steps—significantly outperforming semi-autoregressive baselines. Experiments demonstrate that DUS simultaneously improves generation quality and inference speed on mathematical reasoning and code generation tasks.

Technology Category

Application Category

📝 Abstract

Masked diffusion language models (MDLM) have shown strong promise for non-autoregressive text generation, yet existing samplers act as implicit planners, selecting tokens to unmask via denoiser confidence or entropy scores. Such heuristics falter under parallel unmasking - they ignore pairwise interactions between tokens and cannot account for dependencies when unmasking multiple positions at once, limiting their inference time to traditional auto-regressive (AR) models. We introduce the Dilated-scheduled Unmasking Strategy (DUS), an inference-only, planner-model-free method that requires no additional training. DUS leverages a first-order Markov assumption to partition sequence positions into dilation-based groups of non-adjacent tokens, enabling independent, parallel unmasking steps that respect local context that minimizes the joint entropy of each iteration step. Unlike semi-AR block approaches (e.g., LLADA and Dream) that still invoke the denoiser per block, DUS reduces the number of denoiser calls to O(log B) per generation block - yielding substantial speedup over the O(B) run time of state-of-the-art diffusion models, where B is the block size in the semi-AR inference process. In experiments on math (GSM8K) and code completion (Humaneval, MBPP) benchmarks - domains suited to non-ordinal generation - DUS improves scores over parallel confidence-based planner, without modifying the underlying denoiser. DUS offers a lightweight, budget-aware approach to efficient, high-quality text generation, paving the way to unlock the true capabilities of MDLMs.

Problem

Research questions and friction points this paper is trying to address.

Improving parallel unmasking in masked diffusion language models

Reducing denoiser calls for faster non-autoregressive generation

Enhancing text generation quality without modifying the denoiser

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dilated-scheduled Unmasking Strategy for parallel decoding

First-order Markov assumption for non-adjacent token grouping

Reduces denoiser calls to O(log B) per block

🔎 Similar Papers

No similar papers found.