π€ AI Summary
Masked diffusion language models (MDMs) struggle to scale to larger architectures due to slow training convergence. This work identifies the locality bias inherent in natural language as a key factor underlying this inefficiency and introduces a bell-shaped time sampling strategy to better align the training dynamics with the modelβs learning process. The proposed approach significantly accelerates convergence while preserving final model performance. On the LM1B benchmark, it achieves approximately a 4Γ speedup in training time and demonstrates consistently faster improvements across multiple evaluation metrics, including generation perplexity, zero-shot generalization, and downstream task performance.
π Abstract
Masked diffusion models (MDMs) have emerged as a promising alternative to autoregressive models (ARMs) for language modeling. However, MDMs are known to learn substantially more slowly than ARMs, which may become problematic when scaling MDMs to larger models. Therefore, we ask the following question: how can we accelerate standard MDM training while maintaining its final performance? To this end, we first provide a detailed analysis of why MDM training is slow. We find that the main factor is the locality bias of language: the predictive information for a token is concentrated in nearby positions. We further investigate how this bias slows learning and suggest a simple yet effective remedy: bell-shaped time sampling as a training strategy. Notably, MDMs trained with our training recipe reach the same validation negative log-likelihood (NLL) up to $\sim4\times$ faster than standard training on One Billion Word Benchmark (LM1B). We also show faster improvements in generative perplexity, zero-shot perplexity, and downstream task performance on various benchmarks.