Masked Diffusion Language Models with Frequency-Informed Training

📅 2025-09-05

📈 Citations: 0

✨ Influential: 0

career value

169K/year

🤖 AI Summary

This work addresses language modeling under extreme data scarcity (<1M tokens) in the BabyLM 2025 challenge. Method: We propose a Frequency-Aware Diffusion Language Model (FADLM), featuring (i) a frequency-aware masking mechanism that prioritizes masking and reconstructing low-frequency tokens to improve rare-word learning, and (ii) a dual-mode noise scheduling scheme with dynamic noise weighting based on the Negative Evidence Lower Bound (NELBO) to enhance sequence modeling stability. Contribution/Results: Experiments show that FADLM matches the performance of hybrid autoregressive–masked baselines on the BabyLM benchmark. It is the first pure diffusion architecture empirically demonstrated to acquire syntactic structures, world knowledge, and human-like linguistic distributions under ultra-low-resource conditions—thereby establishing a novel paradigm for data-constrained language modeling.

Technology Category

Application Category

📝 Abstract

We present a masked diffusion language modeling framework for data-efficient training for the BabyLM 2025 Challenge. Our approach applies diffusion training objectives to language modeling under strict data constraints, incorporating frequency-informed masking that prioritizes learning from rare tokens while maintaining theoretical validity. We explore multiple noise scheduling strategies, including two-mode approaches, and investigate different noise weighting schemes within the NELBO objective. We evaluate our method on the BabyLM benchmark suite, measuring linguistic competence, world knowledge, and human-likeness. Results show performance competitive to hybrid autoregressive-masked baselines, demonstrating that diffusion-based training offers a viable alternative for data-restricted language learning.

Problem

Research questions and friction points this paper is trying to address.

Data-efficient training under strict constraints

Frequency-informed masking for rare token learning

Competitive performance with hybrid baselines

Innovation

Methods, ideas, or system contributions that make the work stand out.

Masked diffusion language modeling framework

Frequency-informed masking prioritizes rare tokens

Multiple noise scheduling and weighting strategies

🔎 Similar Papers

TEncDM: Understanding the Properties of the Diffusion Model in the Space of Language Model Encodings