Diffusion Beats Autoregressive in Data-Constrained Settings

📅 2025-07-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work investigates the data efficiency and performance advantages of diffusion language models over autoregressive (AR) models under data-constrained regimes—characterized by scarce training samples and repeated data reuse. We propose a masked diffusion modeling framework and systematically analyze the interplay among token ordering, prediction objectives, and training dynamics. Our analysis reveals an implicit data augmentation effect inherent to diffusion models, enabling more efficient reuse of limited training instances. We derive a theoretical critical computational budget threshold—expressed in closed form—beyond which diffusion models provably outperform AR counterparts. Empirical results confirm that, under computation-sufficient but data-scarce conditions, diffusion models achieve significantly lower validation loss and superior downstream task performance compared to AR models; gains are further amplified with multi-epoch training on the same dataset.

Technology Category

Application Category

📝 Abstract
Autoregressive (AR) models have long dominated the landscape of large language models, driving progress across a wide range of tasks. Recently, diffusion-based language models have emerged as a promising alternative, though their advantages over AR models remain underexplored. In this paper, we systematically study masked diffusion models in data-constrained settings-where training involves repeated passes over limited data-and find that they significantly outperform AR models when compute is abundant but data is scarce. Diffusion models make better use of repeated data, achieving lower validation loss and superior downstream performance. We interpret this advantage as implicit data augmentation: masked diffusion exposes the model to a diverse distribution of token orderings and prediction tasks, unlike AR's fixed left-to-right factorization. We find new scaling laws for diffusion models and derive a closed-form expression for the critical compute threshold at which diffusion begins to outperform AR. These results suggest that when data, not compute, is the bottleneck, diffusion models offer a compelling alternative to the standard AR paradigm. Our code is available at: https://diffusion-scaling.github.io.
Problem

Research questions and friction points this paper is trying to address.

Compares diffusion and autoregressive models in data-constrained settings
Explores diffusion models' superior performance with limited data
Identifies compute threshold for diffusion outperforming autoregressive models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Masked diffusion models outperform AR models
Implicit data augmentation via diverse token orderings
New scaling laws for diffusion models derived