Is Position Bias in Dense Retrievers Built In-or Learned from Data?

📅 2026-05-26

📈 Citations: 0

✨ Influential: 0

career value

184K/year

🤖 AI Summary

This work investigates position bias in dense retrievers, which tend to favor documents where relevant evidence appears near the beginning, often overlooking information located later. To systematically examine how the distribution of evidence positions in training data influences this bias, the authors construct synthetic training sets with controlled evidence placement (at the beginning, middle, or end) and fine-tune eight distinct pre-trained architectures. They demonstrate for the first time that the positional distribution of evidence in training data is a key controllable factor driving position bias. By introducing a position-balanced data construction strategy, they effectively mitigate this bias without compromising average retrieval performance. Experimental results show that their approach reduces model sensitivity to evidence position by 57%–87% on position-aware evaluation benchmarks.

📝 Abstract

Dense retrievers exhibit positional bias, favoring documents whose query-relevant information appears near the beginning and degrading retrieval performance when the information appears later. While prior work on positional bias in dense retrievers has largely focused on architectural explanations, we study how the positional distribution of evidence in training data affects retrieval-level bias direction. To test this, we construct synthetic position-targeted training sets in which query-relevant evidence appears at the beginning, middle, or end of documents, and fine-tune eight architecturally diverse pretrained models under position-skewed and balanced training distributions. At the ranking level, we observe a strong directional pattern across the examined models: skewed training distributions favor evidence at the corresponding positions. Position-balanced training reduces positional sensitivity by 57--87\% on position-aware benchmarks, with competitive mean retrieval performance in our controlled setting. Representation-level analyses further suggest that fine-tuning often reshapes learned positional preferences, although pre-existing architectural or pretraining-specific tendencies persist in some models. These results identify training-position distribution as a major controllable factor in retrieval-level position bias and suggest balanced data curation as a practical mitigation strategy.

Problem

Research questions and friction points this paper is trying to address.

position bias

dense retrievers

training data distribution

retrieval performance

evidence position

Innovation

Methods, ideas, or system contributions that make the work stand out.

positional bias

dense retrieval

training data distribution