🤖 AI Summary
This work identifies a novel positional bias in long-context large language models—termed “multi-relevant-span distance bias”—where model performance degrades significantly as the relative positional distance between multiple critical information spans increases. To systematically study this phenomenon, we introduce LongPiBench, the first benchmark supporting multi-span localization evaluation, and conduct comprehensive assessments across 11 state-of-the-art models. Our study is the first to quantitatively characterize and empirically validate this distance-dependent bias, moving beyond conventional single-span bias analyses. Experimental results show that although most models have mitigated the “middle-token forgetting” issue, they remain highly sensitive to inter-span distances—a bias consistently observed across both commercial and open-source models. This work provides a new analytical lens for long-context modeling and delivers a reproducible, span-aware evaluation infrastructure to advance research in context-length scaling and positional generalization.
📝 Abstract
Positional bias in large language models (LLMs) hinders their ability to effectively process long inputs. A prominent example is the"lost in the middle"phenomenon, where LLMs struggle to utilize relevant information situated in the middle of the input. While prior research primarily focuses on single pieces of relevant information, real-world applications often involve multiple relevant information pieces. To bridge this gap, we present LongPiBench, a benchmark designed to assess positional bias involving multiple pieces of relevant information. Thorough experiments are conducted with five commercial and six open-source models. These experiments reveal that while most current models are robust against the"lost in the middle"issue, there exist significant biases related to the spacing of relevant information pieces. These findings highlight the importance of evaluating and reducing positional biases to advance LLM's capabilities.