๐ค AI Summary
Existing open-source diffusion large language models (dLLMs) suffer from significantly slower inference speeds compared to autoregressive (AR) models of comparable scale. Method: We propose Discrete Diffusion Forcing (D2F), a novel strategy that reformulates dLLMs into an AR-diffusion hybrid paradigm. D2F enables block-level parallel decoding, KV cache reuse, andโcruciallyโfor the first time, cross-block parallel prediction and block-level autoregressive generation. Integrated with asymmetric knowledge distillation and pipelined parallel decoding, D2F constructs an efficient inference architecture atop pre-trained dLLMs. Contribution/Results: Experiments on GSM8K show that our method achieves 2.5ร higher inference throughput than LLaMA3 and Qwen2.5, and over 50ร speedup relative to prior dLLM baselines (e.g., LLADa, Dream), while preserving competitive generation quality. This marks the first instance where an open-source dLLM surpasses AR models across both efficiency and capability in practical inference.
๐ Abstract
Diffusion Large Language Models (dLLMs) have emerged as a promising alternative to autoregressive (AR) LLMs for text generation, with the potential to decode multiple tokens in a single iteration. However, none of the existing open-source dLLMs have achieved superior inference speed over AR LLMs of similar size. This paper breaks this barrier based on a simple and effective strategy named discrete diffusion forcing (D2F). D2F equips dLLMs with two key capabilities: (1) block-wise autoregressive generation to enable KV cache utilization; (2) prediction of following tokens without requiring completion of prior blocks for inter-block parallel decoding. In this way, the vanilla dLLMs are refurbished into an AR-diffusion hybrid paradigm for efficient inference. D2F can be implemented with an asymmetric distillation process based on pre-trained dLLMs. We further propose a pipelined parallel decoding algorithm, which enables a trade-off between efficiency and efficacy. Empirically, D2F dLLMs achieve more than $mathbf{2.5 imes}$ inference speed than LLaMA3 and Qwen2.5 on GSM8K. Compared to vanilla dLLMs like LLaDA and Dream, the acceleration can be more than $mathbf{50 imes}$ while maintaining comparable output quality. The code is available at https://github.com/zhijie-group/Discrete-Diffusion-Forcing.