Scaling Beyond Masked Diffusion Language Models

📅 2026-02-16

📈 Citations: 0

✨ Influential: 0

career value

212K/year

🤖 AI Summary

This work addresses the overreliance of existing research on masked diffusion language models and perplexity as a primary evaluation metric, which overlooks the potential of alternative discrete diffusion approaches in terms of generation efficiency and practical applicability. The study systematically investigates scaling behaviors of uniform-state diffusion and interpolation-based diffusion language models, establishing—for the first time—a scaling law for non-masked diffusion architectures and exposing the limitations of perplexity in cross-model comparisons. By introducing a speed–quality Pareto frontier evaluation framework and leveraging a cross-entropy optimization objective, the authors train and compare models at the 1.7B parameter scale. Experiments demonstrate that uniform-state diffusion outperforms both autoregressive and masked diffusion models on GSM8K, while an optimized masked diffusion variant achieves approximately 12% higher FLOPs efficiency, collectively validating the competitiveness of diverse diffusion methodologies.

Technology Category

Application Category

📝 Abstract

Diffusion language models are a promising alternative to autoregressive models due to their potential for faster generation. Among discrete diffusion approaches, Masked diffusion currently dominates, largely driven by strong perplexity on language modeling benchmarks. In this work, we present the first scaling law study of uniform-state and interpolating discrete diffusion methods. We also show that Masked diffusion models can be made approximately 12% more FLOPs-efficient when trained with a simple cross-entropy objective. We find that perplexity is informative within a diffusion family but can be misleading across families, where models with worse likelihood scaling may be preferable due to faster and more practical sampling, as reflected by the speed-quality Pareto frontier. These results challenge the view that Masked diffusion is categorically the future of diffusion language modeling and that perplexity alone suffices for cross-algorithm comparison. Scaling all methods to 1.7B parameters, we show that uniform-state diffusion remains competitive on likelihood-based benchmarks and outperforms autoregressive and Masked diffusion models on GSM8K, despite worse validation perplexity. We provide the code, model checkpoints, and video tutorials on the project page: http://s-sahoo.github.io/scaling-dllms

Problem

Research questions and friction points this paper is trying to address.

diffusion language models

masked diffusion

perplexity

scaling laws

sampling efficiency

Innovation

Methods, ideas, or system contributions that make the work stand out.

discrete diffusion

scaling laws

perplexity limitations