Lost in the Evidence? Reproducing Document Position and Context Size Effects in RAG

📅 2026-05-26

📈 Citations: 0

✨ Influential: 0

career value

157K/year

🤖 AI Summary

This study addresses the inconsistent and often irreproducible findings in existing literature regarding the impact of retrieved document ordering and context length in Retrieval-Augmented Generation (RAG) systems. Under a controlled evaluation framework, we systematically examine positional effects—such as the “lost in the middle” phenomenon—in large language models and analyze how ranking strategies and context length influence performance in realistic RAG settings. We propose a novel topic-calibration method based on repeated subset sampling to effectively reduce variance and stabilize trend estimation. Through ablation studies, multi-topic budgeted sampling, comparisons between LLM and human evaluations, and joint analysis with retrieval quality, we demonstrate that conclusions drawn from idealized experimental setups do not necessarily generalize to real-world RAG systems, as positional sensitivity is highly dependent on both retrieval quality and model choice—highlighting the need for more robust RAG evaluation paradigms.

📝 Abstract

Retrieval-Augmented Generation (RAG) systems rely on retrieved documents being concatenated into a model's input context, making both document ordering and context size critical yet controversial design choices. Prior work reports position-based effects such as lost in the middle and related long-context phenomena. However, empirical findings remain inconsistent and hard to reproduce across models, datasets, and evaluation protocols. In this paper, we present a systematic reproducibility study that revisits these claims and examines how they evolve with contemporary LLMs under a controlled evaluation framework. We first show that topic sampling is a major source of variance: small topic sets can mask or exaggerate ordering effects. Based on repeated subset sampling across multiple topic budgets, we provide a practical calibration procedure that identifies topic counts yielding stable trends at feasible cost. Using these fixed topic sets, we then reproduce and extend results on position sensitivity, re-evaluating lost in the middle and positional biases in modern LLMs. Then, we also study a more realistic RAG scenario in which relevance is mediated by a retriever rather than oracle access to ground-truth documents. In this setting, we re-examine a recent industry study and identify discrepancies to evaluation choices such as limited topic coverage and reliance on LLM-based judges. Finally, we conduct an analysis of how retrieval order and context size affect downstream LLM performance under imperfect retrieval. Our results demonstrate that both factors interact strongly with retrieval quality and model choice, and that conclusions drawn from idealised setups do not always transfer to real-world RAG pipelines. We release all code and configurations to support reproducibility and future work on robust RAG evaluation.

Problem

Research questions and friction points this paper is trying to address.

Retrieval-Augmented Generation

position effects

context size

reproducibility

lost in the middle

Innovation

Methods, ideas, or system contributions that make the work stand out.

reproducibility

RAG

position bias