InfoGain-RAG: Boosting Retrieval-Augmented Generation via Document Information Gain-based Reranking and Filtering

📅 2025-09-16

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Existing RAG systems lack quantitative assessment of individual documents’ actual contribution to answer generation, rendering them vulnerable to irrelevant or misleading retrieved content. To address this, we propose Document Information Gain (DIG), the first metric that quantifies a document’s information contribution based on confidence-score differences induced by its inclusion/exclusion in LLM-based answer generation. Building upon DIG, we introduce InfoGain-RAG—a framework that trains a dedicated re-ranker via supervised learning to perform precise re-ranking and dynamic filtering of retrieval results, compatible with both single- and multi-hop retrieval paradigms. Evaluated on benchmarks including NaturalQA, InfoGain-RAG significantly improves robustness and accuracy: exact match accuracy increases by up to 17.9%, with an average gain of 15.3% on GPT-4o—outperforming state-of-the-art RAG baselines.

Technology Category

Application Category

📝 Abstract

Retrieval-Augmented Generation (RAG) has emerged as a promising approach to address key limitations of Large Language Models (LLMs), such as hallucination, outdated knowledge, and lacking reference. However, current RAG frameworks often struggle with identifying whether retrieved documents meaningfully contribute to answer generation. This shortcoming makes it difficult to filter out irrelevant or even misleading content, which notably impacts the final performance. In this paper, we propose Document Information Gain (DIG), a novel metric designed to quantify the contribution of retrieved documents to correct answer generation. DIG measures a document's value by computing the difference of LLM's generation confidence with and without the document augmented. Further, we introduce InfoGain-RAG, a framework that leverages DIG scores to train a specialized reranker, which prioritizes each retrieved document from exact distinguishing and accurate sorting perspectives. This approach can effectively filter out irrelevant documents and select the most valuable ones for better answer generation. Extensive experiments across various models and benchmarks demonstrate that InfoGain-RAG can significantly outperform existing approaches, on both single and multiple retrievers paradigm. Specifically on NaturalQA, it achieves the improvements of 17.9%, 4.5%, 12.5% in exact match accuracy against naive RAG, self-reflective RAG and modern ranking-based RAG respectively, and even an average of 15.3% increment on advanced proprietary model GPT-4o across all datasets. These results demonstrate the feasibility of InfoGain-RAG as it can offer a reliable solution for RAG in multiple applications.

Problem

Research questions and friction points this paper is trying to address.

Quantifying document contribution to correct answer generation

Filtering irrelevant documents in retrieval-augmented generation systems

Improving RAG performance through information gain-based reranking

Innovation

Methods, ideas, or system contributions that make the work stand out.

Document Information Gain metric quantifies document contribution

InfoGain-RAG framework trains specialized reranker using DIG

Filters irrelevant documents and selects most valuable ones

🔎 Similar Papers

No similar papers found.

Authors to Follow