What Makes Diffusion Language Models Super Data Learners?

📅 2025-10-05

📈 Citations: 0

✨ Influential: 0

career value

149K/year

🤖 AI Summary

Why do diffusion language models exhibit high data efficiency in low-data regimes? This paper identifies the underlying mechanism through systematic ablation studies: random masking of input tokens is the primary driver of data efficiency, while stochastic regularization techniques—such as MLP dropout and weight decay—yield comparable gains. We propose a multi-phase training analysis framework to quantitatively disentangle the contributions of individual components, establishing for the first time that stochastic regularization is the key unifying factor enabling efficient learning under data scarcity. Our findings provide a reproducible, mechanistic explanation for the strong low-data performance of diffusion language models and offer empirical guidance for their architectural design. All code, experimental configurations, and detailed ablation results are publicly released to ensure full transparency and reproducibility.

Technology Category

Application Category

📝 Abstract

Recent studies have shown that diffusion language models achieve remarkable data efficiency under limited-data constraints, yet the underlying mechanisms remain unclear. In this work, we perform extensive ablation experiments to disentangle the sources of this efficiency. Our results show that random masking of input tokens plays the dominant role. We further show that similar gains can be obtained through in MLP dropout and weight decay, indicating that stochastic regularization broadly enhances data efficiency in multi-epoch training. Our code is available at https://github.com/zitian-gao/data-efficiency.

Problem

Research questions and friction points this paper is trying to address.

Investigating mechanisms behind diffusion language models' data efficiency

Identifying random masking as key factor for limited-data performance

Demonstrating stochastic regularization enhances multi-epoch training efficiency

Innovation

Methods, ideas, or system contributions that make the work stand out.

Random masking of input tokens enhances data efficiency

MLP dropout improves performance in limited-data scenarios

Stochastic regularization boosts multi-epoch training effectiveness

🔎 Similar Papers

No similar papers found.