Efficient Long-Context Modeling in Diffusion Language Models via Block Approximate Sparse Attention

📅 2026-05-19

📈 Citations: 0

✨ Influential: 0

career value

212K/year

🤖 AI Summary

Diffusion language models incur substantial computational costs when processing ultra-long sequences, and existing block sparse attention mechanisms—relying on fixed sampling patterns—are prone to losing critical information and sensitive to distribution shifts. This work proposes Block Approximate Attention (BA-Att), a framework that identifies high-information regions through block-level pre-downsampling in a compact space, eliminating reliance on fragile positional priors. BA-Att incorporates a lightweight module based on norm-based ranking and covariance compensation to efficiently approximate full attention behavior, with theoretical error analysis guiding its design. At 50% sparsity, BA-Att maintains performance close to full attention across language modeling, multimodal, and video generation tasks, achieving up to a 6.95× speedup in attention computation.

📝 Abstract

Diffusion Language Models (DLMs) enable globally coherent, bidirectional, and controllable text generation, offering advantages over traditional autoregressive LLMs, while scaling to ultra-long sequences remains costly. Many existing block-sparse attention methods select blocks by fixed sampling patterns over the high-resolution attention space, such as tail regions or anti-diagonal stripes. Such prior-driven sampling can miss salient tokens and introduce instability under distribution shifts. In this paper, we propose the Block Approximate Sparse Attention framework (BA-Att) with block-wise pre-downsampled operation, which identifies informative regions within a compact downsampled space, avoiding reliance on brittle positional priors. To analyze its theoretical behavior, we define an oracle post-downsample attention map and formalize the approximation error between pre- and post-downsample schemes. Based on this insight, we introduce a lightweight norm-sorting module and a covariance-compensated correction that approximates full covariance using diagonal QK variances, reducing computational complexity. Extensive experiments show that our operator achieves up to 6.95x acceleration over FlashAttention in attention computation, and maintains near full-attention performance at 50% sparsity across language models, multimodal language models, and video generation models, demonstrating strong efficiency and generalization.

Problem

Research questions and friction points this paper is trying to address.

Diffusion Language Models

Long-Context Modeling

Sparse Attention

Block-Sparse Attention

Computational Efficiency

Innovation

Methods, ideas, or system contributions that make the work stand out.

Block Approximate Sparse Attention

Diffusion Language Models

Long-Context Modeling