SparseD: Sparse Attention for Diffusion Language Models

📅 2025-09-28

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Diffusion language models (DLMs) suffer from high inference latency for long-context generation due to the quadratic complexity of self-attention. We observe two key properties in DLM attention: (1) head-specific sparse patterns that remain highly stable across denoising steps, and (2) critical dependence of generation quality on early denoising steps. Leveraging these insights, we propose head-specific staged sparse attention: sparse patterns are precomputed once per head; full attention is retained in early, quality-critical steps, while later steps switch to sparse computation—accelerated via FlashAttention. Evaluated on 64K-context generation with 1,024 denoising steps, our method achieves up to 1.50× speedup over FlashAttention with no degradation in generation quality. Our core contribution is the first identification and exploitation of head-specificity and inter-step stability in DLM attention, enabling efficient, lossless long-context inference.

Technology Category

Application Category

📝 Abstract

While diffusion language models (DLMs) offer a promising alternative to autoregressive models (ARs), existing open-source DLMs suffer from high inference latency. This bottleneck is mainly due to the attention's quadratic complexity with respect to context length in computing all query-key pairs. Intuitively, to reduce this complexity, a natural strategy is to restrict attention to sparse patterns that retain only the most relevant connections. Such approaches are well-established in ARs, where attention follows fixed and clearly defined sparse patterns. However, in DLMs, we observe distinct sparsity behaviors: (1) attention patterns vary across heads, (2) attention patterns in each head remain highly similar across denoising steps, and (3) early denoising steps are critical for generation. These findings render sparse attention methods designed for ARs largely incompatible with DLMs, as they fail to capture head-specific structures and risk degrading generation when applied in early denoising steps. To address these challenges, we propose SparseD, a novel sparse attention method for DLMs. Leveraging the observations, SparseD only requires pre-computing head-specific sparse patterns one time, and reuses them across all steps. This prevents recomputing sparse patterns at each denoising step. Meanwhile, SparseD uses full attention in the early steps, then switches to sparse attention later to maintain generation quality. Together, these establish SparseD as a practical and efficient solution for deploying DLMs in long-context applications. Experimental results demonstrate that SparseD achieves lossless acceleration, delivering up to $1.50 imes$ speedup over FlashAttention at a 64k context length with 1,024 denoising steps.

Problem

Research questions and friction points this paper is trying to address.

Reducing high inference latency in diffusion language models

Addressing quadratic complexity of attention in DLMs

Developing sparse attention compatible with DLM characteristics

Innovation

Methods, ideas, or system contributions that make the work stand out.

Pre-computes head-specific sparse patterns once

Uses full attention in early denoising steps

Switches to sparse attention in later steps

🔎 Similar Papers

Sparse Autoencoders Reveal Universal Feature Spaces Across Large Language Models

2024-10-09arXiv.orgCitations: 4

Authors to Follow