Understanding and Accelerating the Training of Masked Diffusion Language Models

📅 2026-05-13

📈 Citations: 0

✨ Influential: 0

career value

208K/year

🤖 AI Summary

Masked diffusion language models (MDMs) struggle to scale to larger architectures due to slow training convergence. This work identifies the locality bias inherent in natural language as a key factor underlying this inefficiency and introduces a bell-shaped time sampling strategy to better align the training dynamics with the model’s learning process. The proposed approach significantly accelerates convergence while preserving final model performance. On the LM1B benchmark, it achieves approximately a 4× speedup in training time and demonstrates consistently faster improvements across multiple evaluation metrics, including generation perplexity, zero-shot generalization, and downstream task performance.

📝 Abstract

Masked diffusion models (MDMs) have emerged as a promising alternative to autoregressive models (ARMs) for language modeling. However, MDMs are known to learn substantially more slowly than ARMs, which may become problematic when scaling MDMs to larger models. Therefore, we ask the following question: how can we accelerate standard MDM training while maintaining its final performance? To this end, we first provide a detailed analysis of why MDM training is slow. We find that the main factor is the locality bias of language: the predictive information for a token is concentrated in nearby positions. We further investigate how this bias slows learning and suggest a simple yet effective remedy: bell-shaped time sampling as a training strategy. Notably, MDMs trained with our training recipe reach the same validation negative log-likelihood (NLL) up to $\sim4\times$ faster than standard training on One Billion Word Benchmark (LM1B). We also show faster improvements in generative perplexity, zero-shot perplexity, and downstream task performance on various benchmarks.

Problem

Research questions and friction points this paper is trying to address.

Masked Diffusion Models

Training Acceleration

Language Modeling

Locality Bias

Learning Efficiency

Innovation

Methods, ideas, or system contributions that make the work stand out.

masked diffusion models

locality bias

bell-shaped time sampling