Scaling Behavior of Discrete Diffusion Language Models

📅 2025-12-11

📈 Citations: 0

✨ Influential: 0

career value

199K/year

🤖 AI Summary

Scaling laws for discrete diffusion language models (DLMs) remain poorly understood, exhibiting markedly distinct data-, compute-, and parameter-dependencies compared to autoregressive language models (ALMs). Method: We systematically investigate how noise type affects DLM scaling behavior, introducing a noise-interpolated diffusion mechanism and a unified framework for co-optimizing compute, data volume, and batch size. We conduct the largest FLOPs-driven scaling study to date—spanning 10B parameters and 10²² total FLOPs. Contribution/Results: We establish the novel principle that “noise type governs DLM scaling characteristics,” with uniform noise demonstrating superior training efficiency under data constraints. Uniform-diffusion models achieve loss comparable to masked-diffusion models in the compute-bound regime, while requiring less data and leveraging more parameters for efficient training. This finding breaks through key bottlenecks hindering DLM scalability and provides foundational insights for principled DLM design and scaling.

Technology Category

Application Category

📝 Abstract

Modern LLM pre-training consumes vast amounts of compute and training data, making the scaling behavior, or scaling laws, of different models a key distinguishing factor. Discrete diffusion language models (DLMs) have been proposed as an alternative to autoregressive language models (ALMs). However, their scaling behavior has not yet been fully explored, with prior work suggesting that they require more data and compute to match the performance of ALMs. We study the scaling behavior of DLMs on different noise types by smoothly interpolating between masked and uniform diffusion while paying close attention to crucial hyperparameters such as batch size and learning rate. Our experiments reveal that the scaling behavior of DLMs strongly depends on the noise type and is considerably different from ALMs. While all noise types converge to similar loss values in compute-bound scaling, we find that uniform diffusion requires more parameters and less data for compute-efficient training compared to masked diffusion, making them a promising candidate in data-bound settings. We scale our uniform diffusion model up to 10B parameters trained for $10^{22}$ FLOPs, confirming the predicted scaling behavior and making it the largest publicly known uniform diffusion model to date.

Problem

Research questions and friction points this paper is trying to address.

Explores scaling laws of discrete diffusion language models

Compares performance across different noise types and hyperparameters

Identifies uniform diffusion as efficient in data-limited scenarios

Innovation

Methods, ideas, or system contributions that make the work stand out.

Studying scaling behavior of discrete diffusion models

Comparing uniform and masked diffusion noise types

Scaling uniform diffusion to 10B parameters

🔎 Similar Papers

No similar papers found.