ES-dLLM: Efficient Inference for Diffusion Large Language Models by Early-Skipping

📅 2026-03-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Diffusion-based large language models (dLLMs) incur substantial computational overhead due to repeated processing of the full context during inference. This work proposes a training-free acceleration framework that, for the first time, jointly leverages dynamic changes in intermediate-layer tensors—specifically Key, Value, and hidden states—and token-level confidence to assess token importance on-the-fly, enabling dynamic skipping of unimportant tokens in early layers. The method incorporates an efficient cache scheduling mechanism that significantly enhances inference efficiency while preserving generation quality without degradation. Evaluated on LLaDA-8B and Dream-7B, the approach achieves throughput of 226.57 and 308.51 tokens per second (TPS), respectively—yielding speedups of 5.6× to 16.8× over the baseline implementation and outperforming the current state-of-the-art caching method by up to 1.85×.

Technology Category

Application Category

📝 Abstract
Diffusion large language models (dLLMs) are emerging as a promising alternative to autoregressive models (ARMs) due to their ability to capture bidirectional context and the potential for parallel generation. Despite the advantages, dLLM inference remains computationally expensive as the full input context is processed at every iteration. In this work, we analyze the generation dynamics of dLLMs and find that intermediate representations, including key, value, and hidden states, change only subtly across successive iterations. Leveraging this insight, we propose \textbf{ES-dLLM}, a training-free inference acceleration framework for dLLM that reduces computation by skipping tokens in early layers based on the estimated importance. Token importance is computed with intermediate tensor variation and confidence scores of previous iterations. Experiments on LLaDA-8B and Dream-7B demonstrate that ES-dLLM achieves throughput of up to 226.57 and 308.51 tokens per second (TPS), respectively, on an NVIDIA H200 GPU, delivering 5.6$\times$ to 16.8$\times$ speedup over the vanilla implementation and up to 1.85$\times$ over the state-of-the-art caching method, while preserving generation quality.
Problem

Research questions and friction points this paper is trying to address.

diffusion large language models
inference efficiency
computational cost
token processing
generation dynamics
Innovation

Methods, ideas, or system contributions that make the work stand out.

diffusion large language models
inference acceleration
early token skipping
training-free optimization
intermediate representation stability