Diffusion Beats Autoregressive in Data-Constrained Settings

📅 2025-07-21

📈 Citations: 0

✨ Influential: 0

career value

168K/year

🤖 AI Summary

This work investigates the data efficiency and performance advantages of diffusion language models over autoregressive (AR) models under data-constrained regimes—characterized by scarce training samples and repeated data reuse. We propose a masked diffusion modeling framework and systematically analyze the interplay among token ordering, prediction objectives, and training dynamics. Our analysis reveals an implicit data augmentation effect inherent to diffusion models, enabling more efficient reuse of limited training instances. We derive a theoretical critical computational budget threshold—expressed in closed form—beyond which diffusion models provably outperform AR counterparts. Empirical results confirm that, under computation-sufficient but data-scarce conditions, diffusion models achieve significantly lower validation loss and superior downstream task performance compared to AR models; gains are further amplified with multi-epoch training on the same dataset.

Technology Category

Application Category

📝 Abstract

Autoregressive (AR) models have long dominated the landscape of large language models, driving progress across a wide range of tasks. Recently, diffusion-based language models have emerged as a promising alternative, though their advantages over AR models remain underexplored. In this paper, we systematically study masked diffusion models in data-constrained settings-where training involves repeated passes over limited data-and find that they significantly outperform AR models when compute is abundant but data is scarce. Diffusion models make better use of repeated data, achieving lower validation loss and superior downstream performance. We interpret this advantage as implicit data augmentation: masked diffusion exposes the model to a diverse distribution of token orderings and prediction tasks, unlike AR's fixed left-to-right factorization. We find new scaling laws for diffusion models and derive a closed-form expression for the critical compute threshold at which diffusion begins to outperform AR. These results suggest that when data, not compute, is the bottleneck, diffusion models offer a compelling alternative to the standard AR paradigm. Our code is available at: https://diffusion-scaling.github.io.

Problem

Research questions and friction points this paper is trying to address.

Compares diffusion and autoregressive models in data-constrained settings

Explores diffusion models' superior performance with limited data

Identifies compute threshold for diffusion outperforming autoregressive models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Masked diffusion models outperform AR models

Implicit data augmentation via diverse token orderings

New scaling laws for diffusion models derived

🔎 Similar Papers

Stochastic Diffusion: A Diffusion Based Model for Stochastic Time Series Forecasting