🤖 AI Summary
This work challenges the prevailing assumption that autoregressive models (ARMs) are the sole viable paradigm for large language models (LLMs), investigating diffusion models as a principled alternative. We introduce LLaDA—the first large language diffusion model trained from scratch—built upon a standard Transformer architecture. It models text distributions via forward masking and reverse denoising, optimized under the variational lower bound, and follows a two-stage pretraining–supervised fine-tuning (SFT) pipeline. A key contribution is overcoming the “inversion curse”: we empirically demonstrate, for the first time, that diffusion-based LLMs can match or exceed ARMs in core language capabilities. Specifically, LLaDA-8B achieves on-par in-context learning performance with LLaMA3-8B; after SFT, it exhibits strong instruction following and multi-turn dialogue proficiency; and on reverse poetic completion, it significantly outperforms both GPT-4o and an in-house ARM baseline.
📝 Abstract
Autoregressive models (ARMs) are widely regarded as the cornerstone of large language models (LLMs). We challenge this notion by introducing LLaDA, a diffusion model trained from scratch under the pre-training and supervised fine-tuning (SFT) paradigm. LLaDA models distributions through a forward data masking process and a reverse process, parameterized by a vanilla Transformer to predict masked tokens. By optimizing a likelihood bound, it provides a principled generative approach for probabilistic inference. Across extensive benchmarks, LLaDA demonstrates strong scalability, outperforming our self-constructed ARM baselines. Remarkably, LLaDA 8B is competitive with strong LLMs like LLaMA3 8B in in-context learning and, after SFT, exhibits impressive instruction-following abilities in case studies such as multi-turn dialogue. Moreover, LLaDA addresses the reversal curse, surpassing GPT-4o in a reversal poem completion task. Our findings establish diffusion models as a viable and promising alternative to ARMs, challenging the assumption that key LLM capabilities discussed above are inherently tied to ARMs.