Sparse-dLLM: Accelerating Diffusion LLMs with Dynamic Cache Eviction

📅 2025-08-04

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Diffusion large language models (dLLMs) suffer from quadratic computational and memory complexity, hindering efficient long-context inference; existing caching strategies rely on storing full-layer hidden states, incurring prohibitive memory overhead. To address this, we propose the first training-free, dynamic sparse caching framework. We empirically discover a stability property in cross-layer attention patterns of dLLMs, enabling us to design a latency-aware bidirectional sparse caching mechanism grounded in token importance persistence. This mechanism integrates attention-guided dynamic eviction with selective prefix/suffix cleanup. Evaluated on LLaDA and Dream-series models, our method achieves up to 10× throughput improvement while preserving original model accuracy and peak memory usage—significantly outperforming state-of-the-art caching approaches.

Technology Category

Application Category

📝 Abstract

Diffusion Large Language Models (dLLMs) enable breakthroughs in reasoning and parallel decoding but suffer from prohibitive quadratic computational complexity and memory overhead during inference. Current caching techniques accelerate decoding by storing full-layer states, yet impose substantial memory usage that limit long-context applications. Our analysis of attention patterns in dLLMs reveals persistent cross-layer sparsity, with pivotal tokens remaining salient across decoding steps and low-relevance tokens staying unimportant, motivating selective cache eviction. We propose Sparse-dLLM, the first training-free framework integrating dynamic cache eviction with sparse attention via delayed bidirectional sparse caching. By leveraging the stability of token saliency over steps, it retains critical tokens and dynamically evicts unimportant prefix/suffix entries using an attention-guided strategy. Extensive experiments on LLaDA and Dream series demonstrate Sparse-dLLM achieves up to 10$ imes$ higher throughput than vanilla dLLMs, with comparable performance and similar peak memory costs, outperforming previous methods in efficiency and effectiveness.

Problem

Research questions and friction points this paper is trying to address.

Reduce quadratic computational complexity in dLLMs

Minimize memory overhead during inference

Improve long-context application efficiency

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic cache eviction for efficiency

Delayed bidirectional sparse caching

Attention-guided token retention strategy

🔎 Similar Papers

No similar papers found.

Authors to Follow