🤖 AI Summary
Why do diffusion language models exhibit high data efficiency in low-data regimes? This paper identifies the underlying mechanism through systematic ablation studies: random masking of input tokens is the primary driver of data efficiency, while stochastic regularization techniques—such as MLP dropout and weight decay—yield comparable gains. We propose a multi-phase training analysis framework to quantitatively disentangle the contributions of individual components, establishing for the first time that stochastic regularization is the key unifying factor enabling efficient learning under data scarcity. Our findings provide a reproducible, mechanistic explanation for the strong low-data performance of diffusion language models and offer empirical guidance for their architectural design. All code, experimental configurations, and detailed ablation results are publicly released to ensure full transparency and reproducibility.
📝 Abstract
Recent studies have shown that diffusion language models achieve remarkable data efficiency under limited-data constraints, yet the underlying mechanisms remain unclear. In this work, we perform extensive ablation experiments to disentangle the sources of this efficiency. Our results show that random masking of input tokens plays the dominant role. We further show that similar gains can be obtained through in MLP dropout and weight decay, indicating that stochastic regularization broadly enhances data efficiency in multi-epoch training. Our code is available at https://github.com/zitian-gao/data-efficiency.