Diffusion LLMs Can Do Faster-Than-AR Inference via Discrete Diffusion Forcing

📅 2025-08-08

📈 Citations: 0

✨ Influential: 0

career value

219K/year

🤖 AI Summary

Existing open-source diffusion large language models (dLLMs) suffer from significantly slower inference speeds compared to autoregressive (AR) models of comparable scale. Method: We propose Discrete Diffusion Forcing (D2F), a novel strategy that reformulates dLLMs into an AR-diffusion hybrid paradigm. D2F enables block-level parallel decoding, KV cache reuse, and—crucially—for the first time, cross-block parallel prediction and block-level autoregressive generation. Integrated with asymmetric knowledge distillation and pipelined parallel decoding, D2F constructs an efficient inference architecture atop pre-trained dLLMs. Contribution/Results: Experiments on GSM8K show that our method achieves 2.5× higher inference throughput than LLaMA3 and Qwen2.5, and over 50× speedup relative to prior dLLM baselines (e.g., LLADa, Dream), while preserving competitive generation quality. This marks the first instance where an open-source dLLM surpasses AR models across both efficiency and capability in practical inference.

Technology Category

Application Category

📝 Abstract

Diffusion Large Language Models (dLLMs) have emerged as a promising alternative to autoregressive (AR) LLMs for text generation, with the potential to decode multiple tokens in a single iteration. However, none of the existing open-source dLLMs have achieved superior inference speed over AR LLMs of similar size. This paper breaks this barrier based on a simple and effective strategy named discrete diffusion forcing (D2F). D2F equips dLLMs with two key capabilities: (1) block-wise autoregressive generation to enable KV cache utilization; (2) prediction of following tokens without requiring completion of prior blocks for inter-block parallel decoding. In this way, the vanilla dLLMs are refurbished into an AR-diffusion hybrid paradigm for efficient inference. D2F can be implemented with an asymmetric distillation process based on pre-trained dLLMs. We further propose a pipelined parallel decoding algorithm, which enables a trade-off between efficiency and efficacy. Empirically, D2F dLLMs achieve more than $mathbf{2.5 imes}$ inference speed than LLaMA3 and Qwen2.5 on GSM8K. Compared to vanilla dLLMs like LLaDA and Dream, the acceleration can be more than $mathbf{50 imes}$ while maintaining comparable output quality. The code is available at https://github.com/zhijie-group/Discrete-Diffusion-Forcing.

Problem

Research questions and friction points this paper is trying to address.

Accelerating diffusion LLM inference speed beyond autoregressive models

Enabling parallel decoding without prior block completion requirements

Achieving KV cache utilization in diffusion-based text generation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Discrete diffusion forcing for faster inference

Block-wise autoregressive generation with KV cache

Pipelined parallel decoding algorithm for efficiency

🔎 Similar Papers

Speculative Diffusion Decoding: Accelerating Language Generation through Diffusion