Block Diffusion: Interpolating Between Autoregressive and Diffusion Language Models

📅 2025-03-12

📈 Citations: 0

✨ Influential: 0

career value

229K/year

🤖 AI Summary

Diffusion language models (DLMs) suffer from weak likelihood modeling and fixed-length sequence generation, despite advantages in parallel decoding and controllability. To address these limitations, we propose Block-Diffusion LM—a novel discrete diffusion framework that integrates autoregressive modeling principles for variable-length text generation and efficient parallel sampling. Our key contributions are: (1) an interpolatable block-wise denoising paradigm enabling arbitrary-length sequence modeling; (2) a data-driven adaptive noise scheduling strategy coupled with a gradient variance estimator, substantially improving training stability; and (3) KV-cache optimization and parallel token sampling, accelerating inference. Evaluated on standard language modeling benchmarks, Block-Diffusion LM establishes new state-of-the-art performance among diffusion-based LMs, achieving superior trade-offs among generation quality, inference efficiency, and controllability. The code, pretrained weights, and technical documentation are publicly released.

Technology Category

Application Category

📝 Abstract

Diffusion language models offer unique benefits over autoregressive models due to their potential for parallelized generation and controllability, yet they lag in likelihood modeling and are limited to fixed-length generation. In this work, we introduce a class of block diffusion language models that interpolate between discrete denoising diffusion and autoregressive models. Block diffusion overcomes key limitations of both approaches by supporting flexible-length generation and improving inference efficiency with KV caching and parallel token sampling. We propose a recipe for building effective block diffusion models that includes an efficient training algorithm, estimators of gradient variance, and data-driven noise schedules to minimize the variance. Block diffusion sets a new state-of-the-art performance among diffusion models on language modeling benchmarks and enables generation of arbitrary-length sequences. We provide the code, along with the model weights and blog post on the project page: https://m-arriola.com/bd3lms/

Problem

Research questions and friction points this paper is trying to address.

Interpolates between diffusion and autoregressive language models

Enables flexible-length sequence generation efficiently

Improves inference efficiency with KV caching and parallel token sampling

Innovation

Methods, ideas, or system contributions that make the work stand out.

Interpolates between diffusion and autoregressive models

Supports flexible-length generation with KV caching

Uses data-driven noise schedules for variance minimization

🔎 Similar Papers

No similar papers found.