🤖 AI Summary
LLMs exhibit significant positional bias—termed “lost in the middle”—in long-context tasks, where performance degrades markedly when critical information resides in mid-sequence positions. This work is the first to identify the causal attention mask itself as a primary cause of this bias, revealing that it collaborates with attention weights to induce position-specific distributions in hidden states. To address this, we propose PosScale: a lightweight, plug-and-play, one-dimensional hidden-state scaling mechanism that applies linear rescaling to a single latent dimension—requiring no architectural modification, retraining, or changes to positional encodings. PosScale is compatible with RoPE, Alibi, and extended-context models. Empirically, it achieves up to 15.2% improvement across diverse benchmarks—including NaturalQuestions, KV retrieval, LongBench, and timeline reordering—demonstrating strong cross-model and cross-task generalization. The implementation is publicly available.
📝 Abstract
Large Language Models (LLMs) are increasingly applied in various real-world scenarios due to their excellent generalization capabilities and robust generative abilities. However, they exhibit position bias, also known as"lost in the middle", a phenomenon that is especially pronounced in long-context scenarios, which indicates the placement of the key information in different positions of a prompt can significantly affect accuracy. This paper first explores the micro-level manifestations of position bias, concluding that attention weights are a micro-level expression of position bias. It further identifies that, in addition to position embeddings, causal attention mask also contributes to position bias by creating position-specific hidden states. Based on these insights, we propose a method to mitigate position bias by scaling this positional hidden states. Experiments on the NaturalQuestions Multi-document QA, KV retrieval, LongBench and timeline reorder tasks, using various models including RoPE models, context windowextended models, and Alibi models, demonstrate the effectiveness and generalizability of our approach. Our method can improve performance by up to 15.2% by modifying just one dimension of hidden states. Our code is available at https://aka.ms/PositionalHidden.