Positional Bias in Long-Document Ranking: Impact, Assessment, and Mitigation

📅 2022-07-04

📈 Citations: 9

✨ Influential: 1

career value

181K/year

🤖 AI Summary

This work identifies a pervasive position bias in long-document ranking: relevant passages in mainstream benchmarks (e.g., MS MARCO, BEIR) are overwhelmingly concentrated near document beginnings, causing long-context models (e.g., RankGPT, LongP) to overfit early positions and fail under realistic long-range relevance. First, the authors empirically verify this bias on BEIR’s short-document subsets. Second, they introduce MS MARCO FarRelevant—the first benchmark explicitly designed to diagnose long-range relevance capability—featuring passages located far from document starts. Third, they propose a systematic framework encompassing position distribution analysis, bias-mitigating resampling, and robustness evaluation. Experiments show that over 20 state-of-the-art models perform no better than a trivial first-paragraph truncation baseline on MS MARCO; their accuracy collapses to random chance on FarRelevant; and certain architectural designs exhibit greater robustness to position bias. The study calls for rethinking long-document ranking benchmarks to enable fairer, more reliable evaluation.

📝 Abstract

We tested over 20 Transformer models for ranking long documents (including recent LongP models trained with FlashAttention and RankGPT models"powered"by OpenAI and Anthropic cloud APIs). We compared them with the simple FirstP baseline, which applied the same model to truncated input (up to 512 tokens). On MS MARCO, TREC DL, and Robust04 no long-document model outperformed FirstP by more than 5% (on average). We hypothesized that this lack of improvement is not due to inherent model limitations, but due to benchmark positional bias (most relevant passages tend to occur early in documents), which is known to exist in MS MARCO. To confirm this, we analyzed positional relevance distributions across four long-document corpora (with six query sets) and observed the same early-position bias. Surprisingly, we also found bias in six BEIR collections, which are typically categorized as short-document datasets. We then introduced a new diagnostic dataset, MS MARCO FarRelevant, where relevant spans were deliberately placed beyond the first 512 tokens. On this dataset, many long-context models (including RankGPT) performed at random-baseline level, suggesting overfitting to positional bias. We also experimented with debiasing training data, but with limited success. Our findings (1) highlight the need for careful benchmark design in evaluating long-context models for document ranking, (2) identify model types that are more robust to positional bias, and (3) motivate further work on approaches to debias training data. We release our code and data to support further research.

Problem

Research questions and friction points this paper is trying to address.

Addressing positional bias in long-document ranking benchmarks

Evaluating Transformer models' robustness to early-position relevance bias

Developing debiasing methods for long-context document ranking models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Using diagnostic dataset MS MARCO FarRelevant

Analyzing positional relevance across multiple corpora

Experimenting with debiasing training data techniques

🔎 Similar Papers

No similar papers found.