🤖 AI Summary
Diffusion-based large language models (dLLMs) lack efficient, standardized inference frameworks, hindering their practical deployment. Method: We propose dInfer—the first modular and scalable inference framework tailored for dLLMs—decoupling inference into four core components: model execution, diffusion iteration control, decoding strategy, and KV-cache management. Each component is equipped with dedicated algorithms and system-level optimizations, including denoising-generation acceleration, cache reuse, and parallel scheduling. Results: On an 8×H800 GPU setup, dInfer achieves over 1,100 tokens/s on HumanEval (batch size = 1) and an average throughput exceeding 800 tokens/s across six benchmarks. It outperforms Fast-dLLM by 10× and surpasses optimized autoregressive models by 2–3× in speed, significantly advancing the practical applicability of dLLMs.
📝 Abstract
Diffusion-based large language models (dLLMs) have emerged as a promising alternative to autoregressive (AR) LLMs, leveraging denoising-based generation to enable inherent parallelism. Even more and more open-sourced dLLM models emerge, yet their widespread adoption remains constrained by the lack of a standardized and efficient inference framework. We present dInfer, an efficient and extensible framework for dLLM inference. dInfer decomposes the inference pipeline into four modular components-model, diffusion iteration manager, decoding strategy, and KV-cache manager-and integrates novel algorithms for each component alongside system-level optimizations. Through this combination of algorithmic innovations and system enhancements, dInfer achieves substantial efficiency gains without compromising output quality on LLaDA-MoE. At batch size 1, it surpasses 1,100 tokens per second on HumanEval and averages over 800 tokens per second across six benchmarks on $8 imes$ H800 GPUs. Compared to prior systems, dInfer delivers $10 imes$ speedup over Fast-dLLM while maintaining similar model performance. Even compared with AR models (with a comparable number of activation parameters and performance) QWen2.5-3B, which is highly optimized with latest vLLM inference engine, dInfer still deliverers $2$-$3 imes$ speedup. The implementation of dInfer is open-sourced at https://github.com/inclusionAI/dInfer.