On the Trainability of Masked Diffusion Language Models via Blockwise Locality

📅 2026-04-27

📈 Citations: 0

✨ Influential: 0

career value

207K/year

🤖 AI Summary

Standard Masked Diffusion Models (MDMs) suffer from training instability in structured reasoning tasks due to their random masking mechanism. This work proposes a block-wise locality mechanism that integrates left-to-right autoregressive inductive bias within each block while preserving the model’s iterative, block-level refinement capability—thereby unifying autoregressive local generation with diffusion-based global planning for the first time. Two architectures, Jigsaw and Scatter, are introduced and evaluated on controlled tasks including linear regression, graph pathfinding, and Sudoku solving. Jigsaw achieves training stability comparable to autoregressive models and excels on Sudoku, while Scatter maintains the global planning strengths of diffusion models, demonstrating superior performance in graph pathfinding.

📝 Abstract

Masked diffusion language models (MDMs) have recently emerged as a promising alternative to standard autoregressive large language models (AR-LLMs), yet their optimization can be substantially less stable. We study blockwise MDMs and compare them with AR-LLMs on three controlled tasks that stress different aspects of structured generation: in-context linear regression, graph path-finding, and Sudoku solving. We find that standard random-masking MDMs fail to reliably learn linear regression, exhibit high variance training dynamics on graph path-finding, while outperforming AR-LLMs on Sudoku. To mitigate these instabilities, we propose two locality aware blockwise models, namely Jigsaw and Scatter, that inject left-to-right inductive bias by enforcing autoregressive locality within blocks while preserving iterative refinement at the block level. Empirically, Jigsaw matches AR-LLM stability on linear regression and remains strong on Sudoku, while Scatter retains diffusion's planning advantage on path-finding. Our results indicate that standard random-masking MDMs, even with blockwise variants, may be a suboptimal instantiation of diffusion LMs for ordered generation, motivating models beyond random masking.

Problem

Research questions and friction points this paper is trying to address.

masked diffusion language models

training instability

structured generation

autoregressive locality

ordered generation

Innovation

Methods, ideas, or system contributions that make the work stand out.

masked diffusion language models

blockwise locality

inductive bias