Influence Guided Context Selection for Effective Retrieval-Augmented Generation

📅 2025-09-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
RAG systems often suffer from hallucinations due to low-quality retrieved contexts—e.g., irrelevant or noisy passages. Existing context filtering methods, which rely on predefined metrics, yield limited improvements because they fail to jointly model the dynamic interplay among the query, the candidate context list, and the generator. This work pioneers a *runtime data valuation* perspective for context quality assessment, introducing the Context Influence (CI) value—a metric that quantifies each context’s actual contribution to final generation performance. We propose a hierarchical proxy model that jointly incorporates query-context relevance, inter-context interactions, and generator feedback, enabling parameter-free, online context selection. Evaluated across eight NLP tasks and multiple large language models, our approach significantly outperforms state-of-the-art methods, effectively suppressing interference from low-quality contexts while preserving critical information—thereby enhancing both the robustness and accuracy of RAG systems.

Technology Category

Application Category

📝 Abstract
Retrieval-Augmented Generation (RAG) addresses large language model (LLM) hallucinations by grounding responses in external knowledge, but its effectiveness is compromised by poor-quality retrieved contexts containing irrelevant or noisy information. While existing approaches attempt to improve performance through context selection based on predefined context quality assessment metrics, they show limited gains over standard RAG. We attribute this limitation to their failure in holistically utilizing available information (query, context list, and generator) for comprehensive quality assessment. Inspired by recent advances in data selection, we reconceptualize context quality assessment as an inference-time data valuation problem and introduce the Contextual Influence Value (CI value). This novel metric quantifies context quality by measuring the performance degradation when removing each context from the list, effectively integrating query-aware relevance, list-aware uniqueness, and generator-aware alignment. Moreover, CI value eliminates complex selection hyperparameter tuning by simply retaining contexts with positive CI values. To address practical challenges of label dependency and computational overhead, we develop a parameterized surrogate model for CI value prediction during inference. The model employs a hierarchical architecture that captures both local query-context relevance and global inter-context interactions, trained through oracle CI value supervision and end-to-end generator feedback. Extensive experiments across 8 NLP tasks and multiple LLMs demonstrate that our context selection method significantly outperforms state-of-the-art baselines, effectively filtering poor-quality contexts while preserving critical information. Code is available at https://github.com/SJTU-DMTai/RAG-CSM.
Problem

Research questions and friction points this paper is trying to address.

Improving RAG by filtering irrelevant contexts using influence values
Addressing poor retrieval quality that causes LLM hallucinations
Developing parameterized model to assess context quality holistically
Innovation

Methods, ideas, or system contributions that make the work stand out.

CI value measures context quality via performance degradation
Hierarchical model captures local relevance and global interactions
Retains contexts with positive CI values automatically
🔎 Similar Papers
No similar papers found.
J
Jiale Deng
Shanghai Jiao Tong University University
Yanyan Shen
Yanyan Shen
Shanghai Jiao Tong University
Data managementData analyticsMachine learning systems
Z
Ziyuan Pei
Shanghai Jiao Tong University University
Youmin Chen
Youmin Chen
Shanghai Jiao Tong University University
L
Linpeng Huang
Shanghai Jiao Tong University University