Diffusion Language Models are Super Data Learners

📅 2025-11-05

📈 Citations: 6

✨ Influential: 0

career value

197K/year

🤖 AI Summary

The performance advantages of diffusion language models (DLMs) under data-constrained regimes remain unclear. Method: This work systematically investigates DLM training dynamics and generalization with limited unique tokens, introducing a triple-gain mechanism: (i) arbitrary-order sequence modeling, (ii) high-density computation via iterative bidirectional denoising, and (iii) intrinsic Monte Carlo sampling augmentation—collectively overcoming data-efficiency bottlenecks. Training employs repeated pretraining, input/parameter noise injection, and standard diffusion objectives. Results: A 1.7B-parameter DLM surpasses same-scale autoregressive (AR) models using only 10B unique Python tokens; a 1B-parameter DLM achieves 56.2% accuracy on HellaSwag and 33.7% on MMLU using merely 1B tokens—significantly outperforming AR baselines trained on comparable data volumes. This study provides the first empirical validation of DLMs’ sustained competitive advantage in the small-data, large-model paradigm.

Technology Category

Application Category

📝 Abstract

Under strictly controlled pre-training settings, we observe a Crossover: when unique data is limited, diffusion language models (DLMs) consistently surpass autoregressive (AR) models by training for more epochs. The crossover shifts later with more or higher-quality data, earlier with larger models, and persists across dense and sparse architectures. We attribute the gains to three compounding factors: (1) any-order modeling, (2) super-dense compute from iterative bidirectional denoising, and (3) built-in Monte Carlo augmentation; input or parameter noise improves AR under data constraint but cannot close the gap. At scale, a 1.7B DLM trained with a ~1.5T-token compute budget on 10B unique Python tokens overtakes an AR coder trained with strictly matched settings. In addition, a 1B-parameter DLM achieves>56% accuracy on HellaSwag and>33% on MMLU using only 1B tokens, without any special tricks, just by repeating standard pre-training data. We also show that rising validation cross-entropy does not imply degraded downstream performance in this regime.

Problem

Research questions and friction points this paper is trying to address.

DLMs outperform AR models under data constraints

Crossover point shifts with data quality and model size

DLMs achieve superior results with repeated pre-training data

Innovation

Methods, ideas, or system contributions that make the work stand out.

DLMs outperform AR models under data constraints

DLMs use iterative bidirectional denoising computation

DLMs employ built-in Monte Carlo data augmentation

🔎 Similar Papers

The Remarkable Robustness of LLMs: Stages of Inference?