🤖 AI Summary
The performance advantages of diffusion language models (DLMs) under data-constrained regimes remain unclear. Method: This work systematically investigates DLM training dynamics and generalization with limited unique tokens, introducing a triple-gain mechanism: (i) arbitrary-order sequence modeling, (ii) high-density computation via iterative bidirectional denoising, and (iii) intrinsic Monte Carlo sampling augmentation—collectively overcoming data-efficiency bottlenecks. Training employs repeated pretraining, input/parameter noise injection, and standard diffusion objectives. Results: A 1.7B-parameter DLM surpasses same-scale autoregressive (AR) models using only 10B unique Python tokens; a 1B-parameter DLM achieves 56.2% accuracy on HellaSwag and 33.7% on MMLU using merely 1B tokens—significantly outperforming AR baselines trained on comparable data volumes. This study provides the first empirical validation of DLMs’ sustained competitive advantage in the small-data, large-model paradigm.
📝 Abstract
Under strictly controlled pre-training settings, we observe a Crossover: when unique data is limited, diffusion language models (DLMs) consistently surpass autoregressive (AR) models by training for more epochs. The crossover shifts later with more or higher-quality data, earlier with larger models, and persists across dense and sparse architectures. We attribute the gains to three compounding factors: (1) any-order modeling, (2) super-dense compute from iterative bidirectional denoising, and (3) built-in Monte Carlo augmentation; input or parameter noise improves AR under data constraint but cannot close the gap. At scale, a 1.7B DLM trained with a ~1.5T-token compute budget on 10B unique Python tokens overtakes an AR coder trained with strictly matched settings. In addition, a 1B-parameter DLM achieves>56% accuracy on HellaSwag and>33% on MMLU using only 1B tokens, without any special tricks, just by repeating standard pre-training data. We also show that rising validation cross-entropy does not imply degraded downstream performance in this regime.