🤖 AI Summary
Traditional IR metrics (e.g., nDCG, MRR) assume sequential user browsing and position-based discounting, and only distinguish between relevant and irrelevant documents—making them ill-suited for evaluating retrieval in RAG, where LLMs process retrieved documents holistically and non-sequentially, and where “pseudo-relevant” documents (relevant but harmful to generation) exert negative interference.
Method: We propose a utility-aware evaluation paradigm: (i) a dual-dimensional annotation scheme capturing both document utility and interference, and (ii) an LLM-adapted position-decay function, leading to the novel metric UDCG.
Contribution/Results: UDCG explicitly models both positive utility and negative interference of retrieved documents while relaxing strict positional sensitivity—aligning with LLM behavior. Experiments across five datasets and six LLMs show UDCG achieves up to 36% higher correlation with end-to-end answer accuracy than nDCG or MRR, significantly improving reliability and predictive power in RAG retrieval evaluation.
📝 Abstract
Traditional Information Retrieval (IR) metrics, such as nDCG, MAP, and MRR, assume that human users sequentially examine documents with diminishing attention to lower ranks. This assumption breaks down in Retrieval Augmented Generation (RAG) systems, where search results are consumed by Large Language Models (LLMs), which, unlike humans, process all retrieved documents as a whole rather than sequentially. Additionally, traditional IR metrics do not account for related but irrelevant documents that actively degrade generation quality, rather than merely being ignored. Due to these two major misalignments, namely human vs. machine position discount and human relevance vs. machine utility, classical IR metrics do not accurately predict RAG performance. We introduce a utility-based annotation schema that quantifies both the positive contribution of relevant passages and the negative impact of distracting ones. Building on this foundation, we propose UDCG (Utility and Distraction-aware Cumulative Gain), a metric using an LLM-oriented positional discount to directly optimize the correlation with the end-to-end answer accuracy. Experiments on five datasets and six LLMs demonstrate that UDCG improves correlation by up to 36% compared to traditional metrics. Our work provides a critical step toward aligning IR evaluation with LLM consumers and enables more reliable assessment of RAG components