EntropyCache: Decoded Token Entropy Guided KV Caching for Diffusion Language Models

📅 2026-03-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Diffusion language models suffer from low inference efficiency due to the incompatibility of their bidirectional attention mechanism with KV cache reuse, necessitating a full forward pass at each denoising step. This work proposes a lightweight, training-free KV caching strategy that dynamically decides whether to recompute the cache states of the most recent k tokens by leveraging the maximum entropy of the decoding token distribution as a proxy for cache freshness. The decision overhead is constant—accounting for only 0.5% of total inference time—and independent of context length and model size. Empirical analysis further reveals prolonged post-decoding feature fluctuations across multiple steps. Evaluated on LLaDA-8B-Instruct and Dream-7B-Instruct, the method achieves 15.2–26.4× speedup on standard tasks and 22.4–24.1× on chain-of-thought tasks while maintaining competitive accuracy.

Technology Category

Application Category

📝 Abstract
Diffusion-based large language models (dLLMs) rely on bidirectional attention, which prevents lossless KV caching and requires a full forward pass at every denoising step. Existing approximate KV caching methods reduce this cost by selectively updating cached states, but their decision overhead scales with context length or model depth. We propose EntropyCache, a training-free KV caching method that uses the maximum entropy of newly decoded token distributions as a constant-cost signal for deciding when to recompute. Our design is grounded in two empirical observations: (1) decoded token entropy correlates with KV cache drift, providing a cheap proxy for cache staleness, and (2) feature volatility of decoded tokens persists for multiple steps after unmasking, motivating recomputation of the $k$ most recently decoded tokens. The skip-or-recompute decision requires only $O(V)$ computation per step, independent of context length and model scale. Experiments on LLaDA-8B-Instruct and Dream-7B-Instruct show that EntropyCache achieves $15.2\times$-$26.4\times$ speedup on standard benchmarks and $22.4\times$-$24.1\times$ on chain-of-thought benchmarks, with competitive accuracy and decision overhead accounting for only $0.5\%$ of inference time. Code is available at https://github.com/mscheong01/EntropyCache.
Problem

Research questions and friction points this paper is trying to address.

diffusion language models
KV caching
bidirectional attention
cache staleness
inference efficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

KV caching
diffusion language models
token entropy
training-free acceleration
bidirectional attention
M
Minsoo Cheong
Seoul National University
D
Donghyun Son
Seoul National University
W
Woosang Lim
Seoul National University
Sungjoo Yoo
Sungjoo Yoo
Seoul National University
memorystorage