Sparse-dLLM: Accelerating Diffusion LLMs with Dynamic Cache Eviction

📅 2025-08-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Diffusion large language models (dLLMs) suffer from quadratic computational and memory complexity, hindering efficient long-context inference; existing caching strategies rely on storing full-layer hidden states, incurring prohibitive memory overhead. To address this, we propose the first training-free, dynamic sparse caching framework. We empirically discover a stability property in cross-layer attention patterns of dLLMs, enabling us to design a latency-aware bidirectional sparse caching mechanism grounded in token importance persistence. This mechanism integrates attention-guided dynamic eviction with selective prefix/suffix cleanup. Evaluated on LLaDA and Dream-series models, our method achieves up to 10× throughput improvement while preserving original model accuracy and peak memory usage—significantly outperforming state-of-the-art caching approaches.

Technology Category

Application Category

📝 Abstract
Diffusion Large Language Models (dLLMs) enable breakthroughs in reasoning and parallel decoding but suffer from prohibitive quadratic computational complexity and memory overhead during inference. Current caching techniques accelerate decoding by storing full-layer states, yet impose substantial memory usage that limit long-context applications. Our analysis of attention patterns in dLLMs reveals persistent cross-layer sparsity, with pivotal tokens remaining salient across decoding steps and low-relevance tokens staying unimportant, motivating selective cache eviction. We propose Sparse-dLLM, the first training-free framework integrating dynamic cache eviction with sparse attention via delayed bidirectional sparse caching. By leveraging the stability of token saliency over steps, it retains critical tokens and dynamically evicts unimportant prefix/suffix entries using an attention-guided strategy. Extensive experiments on LLaDA and Dream series demonstrate Sparse-dLLM achieves up to 10$ imes$ higher throughput than vanilla dLLMs, with comparable performance and similar peak memory costs, outperforming previous methods in efficiency and effectiveness.
Problem

Research questions and friction points this paper is trying to address.

Reduce quadratic computational complexity in dLLMs
Minimize memory overhead during inference
Improve long-context application efficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic cache eviction for efficiency
Delayed bidirectional sparse caching
Attention-guided token retention strategy
🔎 Similar Papers
No similar papers found.
Y
Yuerong Song
School of Computer Science, Fudan University, Shanghai Innovation Institute
Xiaoran Liu
Xiaoran Liu
Fudan University
natural language processing
R
Ruixiao Li
School of Computer Science, Fudan University, Shanghai Innovation Institute
Z
Zhigeng Liu
School of Computer Science, Fudan University, Shanghai Innovation Institute
Zengfeng Huang
Zengfeng Huang
Fudan University
AlgorithmsGraphsStreamingLearningTheory
Qipeng Guo
Qipeng Guo
Fudan University
Ziwei He
Ziwei He
Shanghai Jiao Tong University
Machine Learning
X
Xipeng Qiu
School of Computer Science, Fudan University, Shanghai Innovation Institute