Collapse of Dense Retrievers: Short, Early, and Literal Biases Outranking Factual Evidence

📅 2025-03-06

📈 Citations: 0

✨ Influential: 0

career value

152K/year

🤖 AI Summary

Dense retrieval models exhibit multiple shallow biases—including short-document preference, first-paragraph bias, literal matching, and repeated-entity overreliance—causing severe neglect of answer relevance; these biases interact synergistically, leading to “catastrophic” retrieval failure and significantly misleading LLMs in RAG. Method: We introduce the first controllable bias evaluation framework, built upon relation extraction datasets (e.g., Re-DocRED), enabling quantitative measurement of compound bias effects; propose a novel robustness-oriented evaluation paradigm; and conduct systematic mechanistic analysis of state-of-the-art models (e.g., Dragon+, Contriever). Contribution/Results: Experiments reveal that documents containing correct answers are retrieved with <3% probability; RAG performance drops by 34% relative to a no-retrieval baseline—demonstrating that bias harms more than absence of retrieval. Our work establishes theoretical foundations and practical guidelines for building factually reliable, robust, and trustworthy retrieval systems.

Technology Category

Application Category

📝 Abstract

Dense retrieval models are commonly used in Information Retrieval (IR) applications, such as Retrieval-Augmented Generation (RAG). Since they often serve as the first step in these systems, their robustness is critical to avoid failures. In this work, by repurposing a relation extraction dataset (e.g. Re-DocRED), we design controlled experiments to quantify the impact of heuristic biases, such as favoring shorter documents, in retrievers like Dragon+ and Contriever. Our findings reveal significant vulnerabilities: retrievers often rely on superficial patterns like over-prioritizing document beginnings, shorter documents, repeated entities, and literal matches. Additionally, they tend to overlook whether the document contains the query's answer, lacking deep semantic understanding. Notably, when multiple biases combine, models exhibit catastrophic performance degradation, selecting the answer-containing document in less than 3% of cases over a biased document without the answer. Furthermore, we show that these biases have direct consequences for downstream applications like RAG, where retrieval-preferred documents can mislead LLMs, resulting in a 34% performance drop than not providing any documents at all.

Problem

Research questions and friction points this paper is trying to address.

Dense retrievers exhibit biases favoring short, early, and literal content.

Retrievers often ignore whether documents contain query answers, lacking semantic understanding.

Biases in retrievers degrade downstream RAG performance by 34%.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Repurposed relation extraction dataset for experiments

Identified heuristic biases in dense retrieval models

Demonstrated biases' impact on downstream RAG performance

🔎 Similar Papers

No similar papers found.