Understanding and Accelerating the Training of Masked Diffusion Language Models

πŸ“… 2026-05-13
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF

career value

220K/year
πŸ€– AI Summary
Masked diffusion language models (MDMs) struggle to scale to larger architectures due to slow training convergence. This work identifies the locality bias inherent in natural language as a key factor underlying this inefficiency and introduces a bell-shaped time sampling strategy to better align the training dynamics with the model’s learning process. The proposed approach significantly accelerates convergence while preserving final model performance. On the LM1B benchmark, it achieves approximately a 4Γ— speedup in training time and demonstrates consistently faster improvements across multiple evaluation metrics, including generation perplexity, zero-shot generalization, and downstream task performance.
πŸ“ Abstract
Masked diffusion models (MDMs) have emerged as a promising alternative to autoregressive models (ARMs) for language modeling. However, MDMs are known to learn substantially more slowly than ARMs, which may become problematic when scaling MDMs to larger models. Therefore, we ask the following question: how can we accelerate standard MDM training while maintaining its final performance? To this end, we first provide a detailed analysis of why MDM training is slow. We find that the main factor is the locality bias of language: the predictive information for a token is concentrated in nearby positions. We further investigate how this bias slows learning and suggest a simple yet effective remedy: bell-shaped time sampling as a training strategy. Notably, MDMs trained with our training recipe reach the same validation negative log-likelihood (NLL) up to $\sim4\times$ faster than standard training on One Billion Word Benchmark (LM1B). We also show faster improvements in generative perplexity, zero-shot perplexity, and downstream task performance on various benchmarks.
Problem

Research questions and friction points this paper is trying to address.

Masked Diffusion Models
Training Acceleration
Language Modeling
Locality Bias
Learning Efficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

masked diffusion models
locality bias
bell-shaped time sampling
training acceleration
language modeling
πŸ”Ž Similar Papers
No similar papers found.