Curse of High Dimensionality Issue in Transformer for Long-context Modeling

📅 2025-05-28

📈 Citations: 0

✨ Influential: 0

career value

174K/year

🤖 AI Summary

Standard Transformers suffer from computational redundancy and the curse of dimensionality in long-context modeling due to dense self-attention. Method: This paper reframes sequence modeling as a supervised token importance identification task and, for the first time, formulates attention optimization as a linear encoding problem. We propose Dynamic Grouped Attention (DGA), a sparse attention mechanism that learns critical token subsets, performs grouped token aggregation, and conducts dynamic importance scoring. Contribution/Results: Theoretical analysis establishes DGA’s noise robustness and learning efficiency. Empirical evaluation demonstrates substantial reductions in computational cost while maintaining modeling performance comparable to standard Transformers across multiple long-text benchmarks. DGA thus introduces a novel paradigm for efficient long-context modeling.

Technology Category

Application Category

📝 Abstract

Transformer-based large language models (LLMs) excel in natural language processing tasks by capturing long-range dependencies through self-attention mechanisms. However, long-context modeling faces significant computational inefficiencies due to extit{redundant} attention computations: while attention weights are often extit{sparse}, all tokens consume extit{equal} computational resources. In this paper, we reformulate traditional probabilistic sequence modeling as a extit{supervised learning task}, enabling the separation of relevant and irrelevant tokens and providing a clearer understanding of redundancy. Based on this reformulation, we theoretically analyze attention sparsity, revealing that only a few tokens significantly contribute to predictions. Building on this, we formulate attention optimization as a linear coding problem and propose a extit{group coding strategy}, theoretically showing its ability to improve robustness against random noise and enhance learning efficiency. Motivated by this, we propose extit{Dynamic Group Attention} (DGA), which leverages the group coding to explicitly reduce redundancy by aggregating less important tokens during attention computation. Empirical results show that our DGA significantly reduces computational costs while maintaining competitive performance.Code is available at https://github.com/bolixinyu/DynamicGroupAttention.

Problem

Research questions and friction points this paper is trying to address.

Addresses computational inefficiency in long-context Transformer models

Identifies and reduces redundant attention computations in sparse attention

Proposes Dynamic Group Attention to optimize token aggregation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Reformulates sequence modeling as supervised learning

Proposes group coding strategy for attention optimization

Introduces Dynamic Group Attention to reduce redundancy

🔎 Similar Papers

No similar papers found.