Accelerating Diffusion Language Model Inference via Efficient KV Caching and Guided Diffusion

📅 2025-05-27

📈 Citations: 0

✨ Influential: 0

career value

204K/year

🤖 AI Summary

Diffusion language models (DLMs) suffer from slow inference, high computational overhead for long contexts, and token incoherence under parallel generation; existing acceleration techniques often compromise generation quality. This paper proposes a training-free, efficient inference framework. First, it introduces FreeCache—a novel mechanism enabling approximate key-value (KV) cache reuse without retraining. Second, it employs a lightweight autoregressive (AR) model to guide token decoding in a training-free manner. Third, it integrates adaptive pruning of denoising steps and multi-step confidence-weighted sampling. Crucially, the method preserves generation quality with zero degradation while achieving up to 34× end-to-end speedup on open-source benchmarks. For the first time, DLM latency matches—and even surpasses—that of leading AR models such as Qwen2.5 and Llama3, marking a critical step toward practical DLM deployment.

Technology Category

Application Category

📝 Abstract

Diffusion language models offer parallel token generation and inherent bidirectionality, promising more efficient and powerful sequence modeling compared to autoregressive approaches. However, state-of-the-art diffusion models (e.g., Dream 7B, LLaDA 8B) suffer from slow inference. While they match the quality of similarly sized Autoregressive (AR) Models (e.g., Qwen2.5 7B, Llama3 8B), their iterative denoising requires multiple full-sequence forward passes, resulting in high computational costs and latency, particularly for long input prompts and long-context scenarios. Furthermore, parallel token generation introduces token incoherence problems, and current sampling heuristics suffer from significant quality drops with decreasing denoising steps. We address these limitations with two training-free techniques. First, we propose FreeCache, a Key-Value (KV) approximation caching technique that reuses stable KV projections across denoising steps, effectively reducing the computational cost of DLM inference. Second, we introduce Guided Diffusion, a training-free method that uses a lightweight pretrained autoregressive model to supervise token unmasking, dramatically reducing the total number of denoising iterations without sacrificing quality. We conduct extensive evaluations on open-source reasoning benchmarks, and our combined methods deliver up to a 34x end-to-end speedup without compromising accuracy. For the first time, diffusion language models achieve a comparable and even faster latency as the widely adopted autoregressive models. Our work successfully paved the way for scaling up the diffusion language model to a broader scope of applications across different domains.

Problem

Research questions and friction points this paper is trying to address.

Slow inference in diffusion language models due to iterative denoising

Token incoherence issues from parallel generation in diffusion models

Quality drops with reduced denoising steps in current sampling methods

Innovation

Methods, ideas, or system contributions that make the work stand out.

FreeCache reuses stable KV projections

Guided Diffusion reduces denoising iterations

Combined methods achieve 34x speedup

🔎 Similar Papers

Speculative Diffusion Decoding: Accelerating Language Generation through Diffusion