dInfer: An Efficient Inference Framework for Diffusion Language Models

📅 2025-10-09

📈 Citations: 0

✨ Influential: 0

career value

189K/year

🤖 AI Summary

Diffusion-based large language models (dLLMs) lack efficient, standardized inference frameworks, hindering their practical deployment. Method: We propose dInfer—the first modular and scalable inference framework tailored for dLLMs—decoupling inference into four core components: model execution, diffusion iteration control, decoding strategy, and KV-cache management. Each component is equipped with dedicated algorithms and system-level optimizations, including denoising-generation acceleration, cache reuse, and parallel scheduling. Results: On an 8×H800 GPU setup, dInfer achieves over 1,100 tokens/s on HumanEval (batch size = 1) and an average throughput exceeding 800 tokens/s across six benchmarks. It outperforms Fast-dLLM by 10× and surpasses optimized autoregressive models by 2–3× in speed, significantly advancing the practical applicability of dLLMs.

Technology Category

Application Category

📝 Abstract

Diffusion-based large language models (dLLMs) have emerged as a promising alternative to autoregressive (AR) LLMs, leveraging denoising-based generation to enable inherent parallelism. Even more and more open-sourced dLLM models emerge, yet their widespread adoption remains constrained by the lack of a standardized and efficient inference framework. We present dInfer, an efficient and extensible framework for dLLM inference. dInfer decomposes the inference pipeline into four modular components-model, diffusion iteration manager, decoding strategy, and KV-cache manager-and integrates novel algorithms for each component alongside system-level optimizations. Through this combination of algorithmic innovations and system enhancements, dInfer achieves substantial efficiency gains without compromising output quality on LLaDA-MoE. At batch size 1, it surpasses 1,100 tokens per second on HumanEval and averages over 800 tokens per second across six benchmarks on $8 imes$ H800 GPUs. Compared to prior systems, dInfer delivers $10 imes$ speedup over Fast-dLLM while maintaining similar model performance. Even compared with AR models (with a comparable number of activation parameters and performance) QWen2.5-3B, which is highly optimized with latest vLLM inference engine, dInfer still deliverers $2$-$3 imes$ speedup. The implementation of dInfer is open-sourced at https://github.com/inclusionAI/dInfer.

Problem

Research questions and friction points this paper is trying to address.

Developing efficient inference framework for diffusion language models

Addressing lack of standardized inference system for dLLMs

Optimizing inference speed while maintaining output quality

Innovation

Methods, ideas, or system contributions that make the work stand out.

Modular pipeline with four optimized components

Novel algorithms combined with system-level optimizations

Achieves 10x speedup over prior diffusion systems

🔎 Similar Papers

Speculative Diffusion Decoding: Accelerating Language Generation through Diffusion

2024-08-10arXiv.orgCitations: 0

Diffusion Models: A Comprehensive Survey of Methods and Applications

2022-09-02ACM Computing SurveysCitations: 1628

Together AI

$160,000 - $230,000 + equity + benefits

San Francisco, Singapore, Amsterdam / Remote

Authors to Follow