Uncovering the Role of Initial Saliency in U-Shaped Attention Bias: Scaling Initial Token Weight for Enhanced Long-Text Processing

📅 2025-12-15

📈 Citations: 0

✨ Influential: 0

career value

198K/year

🤖 AI Summary

Large language models (LLMs) suffer from the “lost-in-the-middle” phenomenon in long-context modeling, characterized by a U-shaped attention bias—excessive focus on initial and final tokens while neglecting middle segments. Prior work primarily attributes this to positional encoding artifacts. This paper identifies, for the first time, a novel mechanism: *initial saliency*—the disproportionately high attention weights assigned to the initial token propagate and amplify attention to subsequent semantically related tokens, thereby exacerbating attenuation in the middle. To address this, we propose a learnable initial-token weight scaling method that explicitly decouples and regulates this bias source. We further integrate it with positional encoding correction via joint optimization. On the MDQA benchmark, our approach improves performance by 3.6%; in KV-Retrieval, joint optimization yields a 3.4% gain. These results demonstrate substantial enhancement in long-context modeling capability.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) have demonstrated strong performance on a variety of natural language processing (NLP) tasks. However, they often struggle with long-text sequences due to the ``lost in the middle'' phenomenon. This issue has been shown to arise from a U-shaped attention bias, where attention is disproportionately focused on the beginning and end of a text, leaving the middle section underrepresented. While previous studies have attributed this bias to position encoding, our research first identifies an additional factor: initial saliency. It means that in the attention computation for each token, tokens with higher attention weights relative to the initial token tend to receive more attention in the prediction of the next token. We further find that utilizing this property by scaling attention weight between the initial token and others improves the model's ability to process long contexts, achieving a maximum improvement of 3.6% in MDQA dataset. Moreover, combining this approach with existing methods to reduce position encoding bias further enhances performance, achieving a maximum improvement of 3.4% in KV-Retrieval tasks.

Problem

Research questions and friction points this paper is trying to address.

Identifies initial saliency as a factor in U-shaped attention bias

Proposes scaling initial token weight to enhance long-text processing

Combines with existing methods to further improve model performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Scaling initial token attention weight

Addressing U-shaped attention bias

Combining with position encoding methods

🔎 Similar Papers

Mitigate Position Bias in Large Language Models via Scaling a Single Dimension