Redefining Retrieval Evaluation in the Era of LLMs

📅 2025-10-24

📈 Citations: 0

✨ Influential: 0

career value

168K/year

🤖 AI Summary

Traditional IR metrics (e.g., nDCG, MRR) assume sequential user browsing and position-based discounting, and only distinguish between relevant and irrelevant documents—making them ill-suited for evaluating retrieval in RAG, where LLMs process retrieved documents holistically and non-sequentially, and where “pseudo-relevant” documents (relevant but harmful to generation) exert negative interference. Method: We propose a utility-aware evaluation paradigm: (i) a dual-dimensional annotation scheme capturing both document utility and interference, and (ii) an LLM-adapted position-decay function, leading to the novel metric UDCG. Contribution/Results: UDCG explicitly models both positive utility and negative interference of retrieved documents while relaxing strict positional sensitivity—aligning with LLM behavior. Experiments across five datasets and six LLMs show UDCG achieves up to 36% higher correlation with end-to-end answer accuracy than nDCG or MRR, significantly improving reliability and predictive power in RAG retrieval evaluation.

Technology Category

Application Category

📝 Abstract

Traditional Information Retrieval (IR) metrics, such as nDCG, MAP, and MRR, assume that human users sequentially examine documents with diminishing attention to lower ranks. This assumption breaks down in Retrieval Augmented Generation (RAG) systems, where search results are consumed by Large Language Models (LLMs), which, unlike humans, process all retrieved documents as a whole rather than sequentially. Additionally, traditional IR metrics do not account for related but irrelevant documents that actively degrade generation quality, rather than merely being ignored. Due to these two major misalignments, namely human vs. machine position discount and human relevance vs. machine utility, classical IR metrics do not accurately predict RAG performance. We introduce a utility-based annotation schema that quantifies both the positive contribution of relevant passages and the negative impact of distracting ones. Building on this foundation, we propose UDCG (Utility and Distraction-aware Cumulative Gain), a metric using an LLM-oriented positional discount to directly optimize the correlation with the end-to-end answer accuracy. Experiments on five datasets and six LLMs demonstrate that UDCG improves correlation by up to 36% compared to traditional metrics. Our work provides a critical step toward aligning IR evaluation with LLM consumers and enables more reliable assessment of RAG components

Problem

Research questions and friction points this paper is trying to address.

Traditional IR metrics fail with LLM-based retrieval systems

Existing metrics ignore negative impact of distracting documents

New evaluation metric improves correlation with answer accuracy

Innovation

Methods, ideas, or system contributions that make the work stand out.

Utility-based annotation schema quantifying positive and negative impacts

UDCG metric with LLM-oriented positional discount

Improves correlation with answer accuracy by up to 36%

🔎 Similar Papers

No similar papers found.