🤖 AI Summary
Scaling laws for discrete diffusion language models (DLMs) remain poorly understood, exhibiting markedly distinct data-, compute-, and parameter-dependencies compared to autoregressive language models (ALMs).
Method: We systematically investigate how noise type affects DLM scaling behavior, introducing a noise-interpolated diffusion mechanism and a unified framework for co-optimizing compute, data volume, and batch size. We conduct the largest FLOPs-driven scaling study to date—spanning 10B parameters and 10²² total FLOPs.
Contribution/Results: We establish the novel principle that “noise type governs DLM scaling characteristics,” with uniform noise demonstrating superior training efficiency under data constraints. Uniform-diffusion models achieve loss comparable to masked-diffusion models in the compute-bound regime, while requiring less data and leveraging more parameters for efficient training. This finding breaks through key bottlenecks hindering DLM scalability and provides foundational insights for principled DLM design and scaling.
📝 Abstract
Modern LLM pre-training consumes vast amounts of compute and training data, making the scaling behavior, or scaling laws, of different models a key distinguishing factor. Discrete diffusion language models (DLMs) have been proposed as an alternative to autoregressive language models (ALMs). However, their scaling behavior has not yet been fully explored, with prior work suggesting that they require more data and compute to match the performance of ALMs.
We study the scaling behavior of DLMs on different noise types by smoothly interpolating between masked and uniform diffusion while paying close attention to crucial hyperparameters such as batch size and learning rate. Our experiments reveal that the scaling behavior of DLMs strongly depends on the noise type and is considerably different from ALMs. While all noise types converge to similar loss values in compute-bound scaling, we find that uniform diffusion requires more parameters and less data for compute-efficient training compared to masked diffusion, making them a promising candidate in data-bound settings. We scale our uniform diffusion model up to 10B parameters trained for $10^{22}$ FLOPs, confirming the predicted scaling behavior and making it the largest publicly known uniform diffusion model to date.