🤖 AI Summary
This work addresses language modeling under extreme data scarcity (<1M tokens) in the BabyLM 2025 challenge. Method: We propose a Frequency-Aware Diffusion Language Model (FADLM), featuring (i) a frequency-aware masking mechanism that prioritizes masking and reconstructing low-frequency tokens to improve rare-word learning, and (ii) a dual-mode noise scheduling scheme with dynamic noise weighting based on the Negative Evidence Lower Bound (NELBO) to enhance sequence modeling stability. Contribution/Results: Experiments show that FADLM matches the performance of hybrid autoregressive–masked baselines on the BabyLM benchmark. It is the first pure diffusion architecture empirically demonstrated to acquire syntactic structures, world knowledge, and human-like linguistic distributions under ultra-low-resource conditions—thereby establishing a novel paradigm for data-constrained language modeling.
📝 Abstract
We present a masked diffusion language modeling framework for data-efficient training for the BabyLM 2025 Challenge. Our approach applies diffusion training objectives to language modeling under strict data constraints, incorporating frequency-informed masking that prioritizes learning from rare tokens while maintaining theoretical validity. We explore multiple noise scheduling strategies, including two-mode approaches, and investigate different noise weighting schemes within the NELBO objective. We evaluate our method on the BabyLM benchmark suite, measuring linguistic competence, world knowledge, and human-likeness. Results show performance competitive to hybrid autoregressive-masked baselines, demonstrating that diffusion-based training offers a viable alternative for data-restricted language learning.