Core Context Aware Attention for Long Context Language Modeling

📅 2024-12-17

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

210K/year

🤖 AI Summary

To address the high computational cost of self-attention and degraded modeling of critical long-range dependencies in large language models (LLMs) under long-context settings (e.g., 128K tokens), this paper proposes a plug-and-play core-context-aware attention mechanism. Methodologically, it introduces a novel dual-path architecture integrating global pooling and local preservation: the global path suppresses redundancy via dynamic saliency-weighted token aggregation, while the local path retains fine-grained structural information through neighborhood-aware enhancement. This design simultaneously reduces both computational and memory complexity—scaling subquadratically with sequence length—while significantly improving long-range dependency modeling accuracy. Extensive experiments demonstrate that our method outperforms state-of-the-art approaches across multiple long-context benchmarks, achieving superior trade-offs between inference efficiency and modeling performance.

Technology Category

Application Category

📝 Abstract

Transformer-based Large Language Models (LLMs) have exhibited remarkable success in various natural language processing tasks primarily attributed to self-attention mechanism, which requires a token to consider all preceding tokens as its context to compute the attention score. However, when the context length L becomes very large (e.g., 32K), more redundant context information will be included w.r.t. any tokens, making the self-attention suffer from two main limitations: 1) The computational and memory complexity scales quadratically w.r.t. L; 2) The presence of redundant context information may hamper the model to capture dependencies among crucial tokens, which may degrade the representation performance. In this paper, we propose a plug-and-play Core Context Aware (CCA) Attention for efficient long-range context modeling, which consists of two components: 1) Globality-pooling attention that divides input tokens into groups and then dynamically merges tokens within each group into one core token based on their significance; 2) Locality-preserved attention that incorporates neighboring tokens into the attention calculation. The two complementary attentions will then be fused to the final attention, maintaining comprehensive modeling ability as the full self-attention. In this way, the core context information w.r.t. a given token will be automatically focused and strengthened, while the context information in redundant groups will be diminished during the learning process. As a result, the computational and memory complexity will be significantly reduced. More importantly, the CCA-Attention can improve the long-context modeling ability by diminishing the redundant context information. Extensive experimental results demonstrate that our CCA-Attention significantly outperforms state-of-the-art models in terms of computational efficiency and long-context modeling ability.

Problem

Research questions and friction points this paper is trying to address.

Redundant context in long sequences hampers performance and efficiency

Proposes Core Context Aware Attention for efficient long-context modeling

Dynamically compresses tokens to focus on core context and reduce redundancy

Innovation

Methods, ideas, or system contributions that make the work stand out.

Core Context Aware Attention for long-context modeling

Globality-aware pooling compresses tokens dynamically

Locality-preserving module maintains detailed local context

🔎 Similar Papers

Efficient Length-Generalizable Attention via Causal Retrieval for Long-Context Language Modeling