DAM: Dynamic Attention Mask for Long-Context Large Language Model Inference Acceleration

πŸ“… 2025-06-06
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Long-context modeling is hindered by the quadratic computational complexity of Transformer self-attention; existing sparse attention methods rely on static masks, limiting adaptability to heterogeneous semantic patterns and resulting in suboptimal token interactions and inaccurate retrieval. This paper proposes a dynamic sparse attention mechanism that adaptively generates soft masks directly from attention mapsβ€”without fine-tuning or predefined structural constraints. We introduce a layer- and head-granular, context-aware mask generator, integrating a lightweight encoder with attention-map-level gating to enable zero-shot modeling of heterogeneous patterns. Evaluated on long-document question answering and retrieval tasks, our method achieves performance within 1% of full attention while reducing GPU memory consumption by 42% and inference latency by 38%, enabling efficient deployment on million-token contexts.

Technology Category

Application Category

πŸ“ Abstract
Long-context understanding is crucial for many NLP applications, yet transformers struggle with efficiency due to the quadratic complexity of self-attention. Sparse attention methods alleviate this cost but often impose static, predefined masks, failing to capture heterogeneous attention patterns. This results in suboptimal token interactions, limiting adaptability and retrieval accuracy in long-sequence tasks. This work introduces a dynamic sparse attention mechanism that assigns adaptive masks at the attention-map level, preserving heterogeneous patterns across layers and heads. Unlike existing approaches, our method eliminates the need for fine-tuning and predefined mask structures while maintaining computational efficiency. By learning context-aware attention structures, it achieves high alignment with full-attention models, ensuring minimal performance degradation while reducing memory and compute overhead. This approach provides a scalable alternative to full attention, enabling the practical deployment of large-scale Large Language Models (LLMs) without sacrificing retrieval performance. DAM is available at: https://github.com/HanzhiZhang-Ulrica/DAM.
Problem

Research questions and friction points this paper is trying to address.

Reduces quadratic complexity in long-context transformer self-attention
Replaces static sparse masks with dynamic adaptive attention patterns
Maintains retrieval accuracy while cutting memory and compute costs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic sparse attention mechanism for efficiency
Adaptive masks without fine-tuning or predefined structures
Maintains performance while reducing memory and compute
πŸ”Ž Similar Papers
No similar papers found.