Large Language Diffusion Models

📅 2025-02-14

📈 Citations: 0

✨ Influential: 0

career value

200K/year

🤖 AI Summary

This work challenges the prevailing assumption that autoregressive models (ARMs) are the sole viable paradigm for large language models (LLMs), investigating diffusion models as a principled alternative. We introduce LLaDA—the first large language diffusion model trained from scratch—built upon a standard Transformer architecture. It models text distributions via forward masking and reverse denoising, optimized under the variational lower bound, and follows a two-stage pretraining–supervised fine-tuning (SFT) pipeline. A key contribution is overcoming the “inversion curse”: we empirically demonstrate, for the first time, that diffusion-based LLMs can match or exceed ARMs in core language capabilities. Specifically, LLaDA-8B achieves on-par in-context learning performance with LLaMA3-8B; after SFT, it exhibits strong instruction following and multi-turn dialogue proficiency; and on reverse poetic completion, it significantly outperforms both GPT-4o and an in-house ARM baseline.

Technology Category

Application Category

📝 Abstract

Autoregressive models (ARMs) are widely regarded as the cornerstone of large language models (LLMs). We challenge this notion by introducing LLaDA, a diffusion model trained from scratch under the pre-training and supervised fine-tuning (SFT) paradigm. LLaDA models distributions through a forward data masking process and a reverse process, parameterized by a vanilla Transformer to predict masked tokens. By optimizing a likelihood bound, it provides a principled generative approach for probabilistic inference. Across extensive benchmarks, LLaDA demonstrates strong scalability, outperforming our self-constructed ARM baselines. Remarkably, LLaDA 8B is competitive with strong LLMs like LLaMA3 8B in in-context learning and, after SFT, exhibits impressive instruction-following abilities in case studies such as multi-turn dialogue. Moreover, LLaDA addresses the reversal curse, surpassing GPT-4o in a reversal poem completion task. Our findings establish diffusion models as a viable and promising alternative to ARMs, challenging the assumption that key LLM capabilities discussed above are inherently tied to ARMs.

Problem

Research questions and friction points this paper is trying to address.

Challenges autoregressive models in LLMs

Introduces LLaDA diffusion model

Addresses reversal curse in poem completion

Innovation

Methods, ideas, or system contributions that make the work stand out.

Diffusion model for language

Transformer parameterizes reverse process

Optimizes likelihood for inference

🔎 Similar Papers

No similar papers found.