Streaming-dLLM: Accelerating Diffusion LLMs via Suffix Pruning and Dynamic Decoding

📅 2026-01-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the inefficiencies in diffusion-based large language models (dLLMs) during inference, which arise from uniform modeling of information-sparse suffix regions and the use of fixed denoising schedules, leading to spatial redundancy and suboptimal time efficiency. To overcome these limitations, the authors propose a training-free, efficient inference framework that integrates two key innovations: decay-guided suffix pruning in the spatial dimension to skip low-information regions, and a dynamic confidence-aware early stopping mechanism in the temporal dimension to terminate redundant iterations for converged tokens. Notably, this is the first approach to jointly incorporate suffix pruning and dynamic early stopping without relying on KV caching. Experimental results demonstrate that the method achieves up to a 68.2× speedup while preserving generation quality.

Technology Category

Application Category

📝 Abstract
Diffusion Large Language Models (dLLMs) offer a compelling paradigm for natural language generation, leveraging parallel decoding and bidirectional attention to achieve superior global coherence compared to autoregressive models. While recent works have accelerated inference via KV cache reuse or heuristic decoding, they overlook the intrinsic inefficiencies within the block-wise diffusion process. Specifically, they suffer from spatial redundancy by modeling informative-sparse suffix regions uniformly and temporal inefficiency by applying fixed denoising schedules across all the decoding process. To address this, we propose Streaming-dLLM, a training-free framework that streamlines inference across both spatial and temporal dimensions. Spatially, we introduce attenuation guided suffix modeling to approximate the full context by pruning redundant mask tokens. Temporally, we employ a dynamic confidence aware strategy with an early exit mechanism, allowing the model to skip unnecessary iterations for converged tokens. Extensive experiments show that Streaming-dLLM achieves up to 68.2X speedup while maintaining generation quality, highlighting its effectiveness in diffusion decoding. The code is available at https://github.com/xiaoshideta/Streaming-dLLM.
Problem

Research questions and friction points this paper is trying to address.

Diffusion LLMs
spatial redundancy
temporal inefficiency
suffix modeling
denoising schedule
Innovation

Methods, ideas, or system contributions that make the work stand out.

suffix pruning
dynamic decoding
diffusion LLMs
training-free acceleration
early exit