Trainable Dynamic Mask Sparse Attention

📅 2025-08-04

📈 Citations: 0

✨ Influential: 0

career value

221K/year

🤖 AI Summary

To address the quadratic computational complexity of self-attention in large language models (LLMs) for long-context modeling, this paper proposes a dynamic sparse attention mechanism. Our method introduces the first learnable sparse mask jointly conditioned on content and position: it dynamically generates sparsity patterns from key-value content while incorporating a sliding window to enforce position-aware sparse computation, fully compatible with multi-head and multi-query attention architectures. Unlike static sparsity or fixed-window approaches, our mechanism adaptively concentrates computation on salient tokens and eliminates redundant operations, thus balancing efficiency and modeling fidelity. Under the Chinchilla scaling law, our 1.7B-parameter model achieves significant improvements in both accuracy and throughput over multi-head attention, sliding-window attention, and state-of-the-art sparse attention baselines—across standard long-context benchmarks and the “needle-in-a-haystack” long-range reasoning task.

Technology Category

Application Category

📝 Abstract

In large language models, the demand for modeling long contexts is constantly increasing, but the quadratic complexity of the standard self-attention mechanism often becomes a bottleneck. Although existing sparse attention mechanisms have improved efficiency, they may still encounter issues such as static patterns or information loss. We introduce a trainable dynamic mask sparse attention mechanism, Dynamic Mask Attention, which effectively utilizes content-aware and position-aware sparsity. DMA achieves this through two key innovations: First, it dynamically generates content-aware sparse masks from value representations, enabling the model to identify and focus on critical information adaptively. Second, it implements position-aware sparse attention computation that effectively skips unnecessary calculation regions. This dual-sparsity design allows the model to significantly reduce the computational complexity of important information while retaining complete information, achieving an excellent balance between information fidelity and computational efficiency. We have verified the performance of DMA through comprehensive experiments. Comparative studies show that DMA outperforms multi-head attention, sliding window attention, multi-head latent attention, and native sparse attention in terms of perplexity under Chinchilla Scaling Law settings. Moreover, in challenging multi-query associative recall tasks, DMA also demonstrates superior performance and efficiency compared to these methods. Crucially, in the evaluation of a 1.7B parameter model, DMA significantly outperforms multi-head attention in both standard benchmark performance and the challenging needle-in-a-haystack task. These experimental results highlight its capability to balance model efficiency and long-context modeling ability effectively.

Problem

Research questions and friction points this paper is trying to address.

Reduces quadratic complexity in self-attention for long contexts

Addresses static patterns and information loss in sparse attention

Balances computational efficiency with information fidelity dynamically

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic content-aware sparse mask generation

Position-aware sparse attention computation

Balances information fidelity and efficiency

🔎 Similar Papers

No similar papers found.