Fast-dLLM v2: Efficient Block-Diffusion LLM

📅 2025-09-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the inference inefficiency of autoregressive large language models (LLMs) caused by sequential decoding, this paper proposes Block Diffusion—a framework for high-quality parallel text generation. Methodologically, it introduces (1) block-level diffusion modeling jointly trained with complementary attention masking; (2) a hierarchical KV cache mechanism—operating at both block and sub-block levels—to substantially reduce cache redundancy; and (3) efficient adaptation of pretrained LLMs using only 1 billion tokens of fine-tuning data. Evaluated across diverse benchmarks, Block Diffusion matches or surpasses strong autoregressive baselines in generation quality while achieving up to 2.5× faster decoding speed—the highest inference efficiency among existing diffusion-based language models.

Technology Category

Application Category

📝 Abstract
Autoregressive (AR) large language models (LLMs) have achieved remarkable performance across a wide range of natural language tasks, yet their inherent sequential decoding limits inference efficiency. In this work, we propose Fast-dLLM v2, a carefully designed block diffusion language model (dLLM) that efficiently adapts pretrained AR models into dLLMs for parallel text generation, requiring only approximately 1B tokens of fine-tuning. This represents a 500x reduction in training data compared to full-attention diffusion LLMs such as Dream (580B tokens), while preserving the original model's performance. Our approach introduces a novel training recipe that combines a block diffusion mechanism with a complementary attention mask, enabling blockwise bidirectional context modeling without sacrificing AR training objectives. To further accelerate decoding, we design a hierarchical caching mechanism: a block-level cache that stores historical context representations across blocks, and a sub-block cache that enables efficient parallel generation within partially decoded blocks. Coupled with our parallel decoding pipeline, Fast-dLLM v2 achieves up to 2.5x speedup over standard AR decoding without compromising generation quality. Extensive experiments across diverse benchmarks demonstrate that Fast-dLLM v2 matches or surpasses AR baselines in accuracy, while delivering state-of-the-art efficiency among dLLMs - marking a significant step toward the practical deployment of fast and accurate LLMs. Code and model will be publicly released.
Problem

Research questions and friction points this paper is trying to address.

Parallel text generation with reduced training data
Accelerating inference while preserving model performance
Efficient adaptation of autoregressive models via diffusion
Innovation

Methods, ideas, or system contributions that make the work stand out.

Block diffusion mechanism enables parallel text generation
Hierarchical caching accelerates decoding without quality loss
Minimal fine-tuning preserves AR model performance efficiently