Block Diffusion: Interpolating Between Autoregressive and Diffusion Language Models

📅 2025-03-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Diffusion language models (DLMs) suffer from weak likelihood modeling and fixed-length sequence generation, despite advantages in parallel decoding and controllability. To address these limitations, we propose Block-Diffusion LM—a novel discrete diffusion framework that integrates autoregressive modeling principles for variable-length text generation and efficient parallel sampling. Our key contributions are: (1) an interpolatable block-wise denoising paradigm enabling arbitrary-length sequence modeling; (2) a data-driven adaptive noise scheduling strategy coupled with a gradient variance estimator, substantially improving training stability; and (3) KV-cache optimization and parallel token sampling, accelerating inference. Evaluated on standard language modeling benchmarks, Block-Diffusion LM establishes new state-of-the-art performance among diffusion-based LMs, achieving superior trade-offs among generation quality, inference efficiency, and controllability. The code, pretrained weights, and technical documentation are publicly released.

Technology Category

Application Category

📝 Abstract
Diffusion language models offer unique benefits over autoregressive models due to their potential for parallelized generation and controllability, yet they lag in likelihood modeling and are limited to fixed-length generation. In this work, we introduce a class of block diffusion language models that interpolate between discrete denoising diffusion and autoregressive models. Block diffusion overcomes key limitations of both approaches by supporting flexible-length generation and improving inference efficiency with KV caching and parallel token sampling. We propose a recipe for building effective block diffusion models that includes an efficient training algorithm, estimators of gradient variance, and data-driven noise schedules to minimize the variance. Block diffusion sets a new state-of-the-art performance among diffusion models on language modeling benchmarks and enables generation of arbitrary-length sequences. We provide the code, along with the model weights and blog post on the project page: https://m-arriola.com/bd3lms/
Problem

Research questions and friction points this paper is trying to address.

Interpolates between diffusion and autoregressive language models
Enables flexible-length sequence generation efficiently
Improves inference efficiency with KV caching and parallel token sampling
Innovation

Methods, ideas, or system contributions that make the work stand out.

Interpolates between diffusion and autoregressive models
Supports flexible-length generation with KV caching
Uses data-driven noise schedules for variance minimization
🔎 Similar Papers
No similar papers found.
M
Marianne Arriola
Cornell Tech, NY, USA
Aaron Gokaslan
Aaron Gokaslan
Cornell University
computer visiongraphicsdeep learningrobotics
Justin T Chiu
Justin T Chiu
Cohere
Natural Language Processing
Z
Zhihan Yang
Cornell Tech, NY, USA
Z
Zhixuan Qi
Cornell Tech, NY, USA
J
Jiaqi Han
Stanford University, CA, USA
S
S. Sahoo
Cornell Tech, NY, USA
Volodymyr Kuleshov
Volodymyr Kuleshov
Cornell Tech
Machine LearningArtificial IntelligenceComputational Biology