Scaling Diffusion Language Models via Adaptation from Autoregressive Models

📅 2024-10-23
🏛️ arXiv.org
📈 Citations: 4
Influential: 1
📄 PDF
🤖 AI Summary
Diffusion language models (DLMs) face critical challenges including infeasibility of from-scratch training, absence of fair benchmarking protocols, and scalability limitations. Method: This work introduces a continual fine-tuning paradigm leveraging pretrained autoregressive (AR) models (e.g., GPT-2, LLaMA), establishing for the first time a theoretical connection between AR and diffusion modeling objectives. Key technical innovations include objective-function alignment with pretraining, discrete diffusion process modeling, denoising schedule optimization, and multi-stage token-level noise injection. Contribution/Results: We efficiently construct DLMs spanning 127M to 7B parameters. The released DiffuGPT/DiffuLLaMA series significantly outperform prior DLMs on language modeling, reasoning, and commonsense tasks—matching the performance of same-scale AR baselines—while requiring fewer than 200B tokens for adaptation. These models support fill-in-the-blank generation, in-context learning, and instruction following.

Technology Category

Application Category

📝 Abstract
Diffusion Language Models (DLMs) have emerged as a promising new paradigm for text generative modeling, potentially addressing limitations of autoregressive (AR) models. However, current DLMs have been studied at a smaller scale compared to their AR counterparts and lack fair comparison on language modeling benchmarks. Additionally, training diffusion models from scratch at scale remains challenging. Given the prevalence of open-source AR language models, we propose adapting these models to build text diffusion models. We demonstrate connections between AR and diffusion modeling objectives and introduce a simple continual pre-training approach for training diffusion models. Through systematic evaluation on language modeling, reasoning, and commonsense benchmarks, we show that we can convert AR models ranging from 127M to 7B parameters (GPT2 and LLaMA) into diffusion models DiffuGPT and DiffuLLaMA, using less than 200B tokens for training. Our experimental results reveal that these models outperform earlier DLMs and are competitive with their AR counterparts. We release a suite of DLMs (with 127M, 355M, and 7B parameters) capable of generating fluent text, performing in-context learning, filling in the middle without prompt re-ordering, and following instructions url{https://github.com/HKUNLP/DiffuLLaMA}.
Problem

Research questions and friction points this paper is trying to address.

Adapting autoregressive models to build diffusion models.
Addressing the scale and training challenges of diffusion models.
Comparing performance of diffusion models to autoregressive models.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Adapt AR models for DLMs
Continual pre-training approach
Convert GPT2, LLaMA to diffusion models
🔎 Similar Papers
No similar papers found.