dInfer: An Efficient Inference Framework for Diffusion Language Models

📅 2025-10-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Diffusion-based large language models (dLLMs) lack efficient, standardized inference frameworks, hindering their practical deployment. Method: We propose dInfer—the first modular and scalable inference framework tailored for dLLMs—decoupling inference into four core components: model execution, diffusion iteration control, decoding strategy, and KV-cache management. Each component is equipped with dedicated algorithms and system-level optimizations, including denoising-generation acceleration, cache reuse, and parallel scheduling. Results: On an 8×H800 GPU setup, dInfer achieves over 1,100 tokens/s on HumanEval (batch size = 1) and an average throughput exceeding 800 tokens/s across six benchmarks. It outperforms Fast-dLLM by 10× and surpasses optimized autoregressive models by 2–3× in speed, significantly advancing the practical applicability of dLLMs.

Technology Category

Application Category

📝 Abstract
Diffusion-based large language models (dLLMs) have emerged as a promising alternative to autoregressive (AR) LLMs, leveraging denoising-based generation to enable inherent parallelism. Even more and more open-sourced dLLM models emerge, yet their widespread adoption remains constrained by the lack of a standardized and efficient inference framework. We present dInfer, an efficient and extensible framework for dLLM inference. dInfer decomposes the inference pipeline into four modular components-model, diffusion iteration manager, decoding strategy, and KV-cache manager-and integrates novel algorithms for each component alongside system-level optimizations. Through this combination of algorithmic innovations and system enhancements, dInfer achieves substantial efficiency gains without compromising output quality on LLaDA-MoE. At batch size 1, it surpasses 1,100 tokens per second on HumanEval and averages over 800 tokens per second across six benchmarks on $8 imes$ H800 GPUs. Compared to prior systems, dInfer delivers $10 imes$ speedup over Fast-dLLM while maintaining similar model performance. Even compared with AR models (with a comparable number of activation parameters and performance) QWen2.5-3B, which is highly optimized with latest vLLM inference engine, dInfer still deliverers $2$-$3 imes$ speedup. The implementation of dInfer is open-sourced at https://github.com/inclusionAI/dInfer.
Problem

Research questions and friction points this paper is trying to address.

Developing efficient inference framework for diffusion language models
Addressing lack of standardized inference system for dLLMs
Optimizing inference speed while maintaining output quality
Innovation

Methods, ideas, or system contributions that make the work stand out.

Modular pipeline with four optimized components
Novel algorithms combined with system-level optimizations
Achieves 10x speedup over prior diffusion systems
🔎 Similar Papers
No similar papers found.