๐ค AI Summary
This work addresses the substantial computational redundancy in diffusion language models during inference, where full-sequence attention is repeatedly computedโeven over already decoded or masked regions. The study is the first to reveal structural locality and temporal stability in the decoding process and introduces a training-free sliding window mechanism that dynamically partitions tokens into active, buffer, and far-field regions. Attention is computed only within a localized window, complemented by token-level pruning, KV cache reuse, and a phased refresh strategy. The method is directly applicable to pretrained models and achieves up to 99ร inference speedup on LLaDA and Dream while largely preserving generation quality under the same computational budget.
๐ Abstract
Diffusion language models (DLMs) generate text through iterative denoising, but inference requires full-sequence attention at every iteration, resulting in substantial redundant computation on masked tokens. Block-wise diffusion can reduce this cost, yet it typically relies on retraining and constrained update orders, limiting its direct applicability to pretrained DLMs. Our token-level analysis reveals pronounced structural locality in DLM inference. Decoding is driven by a small set of prefix-localized active tokens; the influence of distant undecoded context diminishes rapidly, and decoded tokens exhibit stage-wise temporal stability, enabling reuse of intermediate representations except for a brief post-decode transient. Motivated by these observations, we propose \textbf{\placeholder}\footnote{The source code is available at https://github.com/vhicrgit/Window-Diffusion.}, a window-based token pruning and caching method for inference. We maintain a local computation window that slides rightward as denoising progresses, and partition undecoded tokens into: (i) \textit{active tokens} that are computed online, (ii) \textit{buffer tokens} whose KV states are cached and periodically refreshed, and (iii) \textit{far-field tokens} that are pruned outside the window. Computation is restricted to active and buffer tokens within the window, while far-field tokens are omitted at each stage. Experiments on LLaDA and Dream show that, under matched compute budgets, our method achieves up to $99\times$ inference speedup while largely preserving generation performance.