🤖 AI Summary
Standard Transformers suffer from computational redundancy and the curse of dimensionality in long-context modeling due to dense self-attention. Method: This paper reframes sequence modeling as a supervised token importance identification task and, for the first time, formulates attention optimization as a linear encoding problem. We propose Dynamic Grouped Attention (DGA), a sparse attention mechanism that learns critical token subsets, performs grouped token aggregation, and conducts dynamic importance scoring. Contribution/Results: Theoretical analysis establishes DGA’s noise robustness and learning efficiency. Empirical evaluation demonstrates substantial reductions in computational cost while maintaining modeling performance comparable to standard Transformers across multiple long-text benchmarks. DGA thus introduces a novel paradigm for efficient long-context modeling.
📝 Abstract
Transformer-based large language models (LLMs) excel in natural language processing tasks by capturing long-range dependencies through self-attention mechanisms. However, long-context modeling faces significant computational inefficiencies due to extit{redundant} attention computations: while attention weights are often extit{sparse}, all tokens consume extit{equal} computational resources. In this paper, we reformulate traditional probabilistic sequence modeling as a extit{supervised learning task}, enabling the separation of relevant and irrelevant tokens and providing a clearer understanding of redundancy. Based on this reformulation, we theoretically analyze attention sparsity, revealing that only a few tokens significantly contribute to predictions. Building on this, we formulate attention optimization as a linear coding problem and propose a extit{group coding strategy}, theoretically showing its ability to improve robustness against random noise and enhance learning efficiency. Motivated by this, we propose extit{Dynamic Group Attention} (DGA), which leverages the group coding to explicitly reduce redundancy by aggregating less important tokens during attention computation. Empirical results show that our DGA significantly reduces computational costs while maintaining competitive performance.Code is available at https://github.com/bolixinyu/DynamicGroupAttention.