Investigating Context-Faithfulness in Large Language Models: The Roles of Memory Strength and Evidence Style

📅 2024-09-17
🏛️ arXiv.org
📈 Citations: 2
Influential: 0
📄 PDF

career value

183K/year
🤖 AI Summary
This study addresses context faithfulness in retrieval-augmented generation (RAG)—specifically, the extent to which large language models (LLMs) ground responses in external retrieved evidence versus internal parametric knowledge. We propose the first method to quantify model memory strength via question-paraphrase response divergence, enabling fine-grained measurement of evidence reliance. We systematically evaluate how evidence presentation styles—such as paraphrasing and detail expansion—affect evidence adoption rates. Using Natural Questions and popQA, we conduct empirical analysis integrating controlled retrieval, prompt engineering, and diversity-aware metrics. Results show that questions with high intrinsic memory strength significantly impair evidence adoption—especially for GPT-4—while paraphrased evidence improves context faithfulness by up to 23.6%, outperforming verbatim repetition or detail-enriched variants. Our work establishes a quantifiable analytical framework for RAG trustworthiness and delivers actionable optimization strategies for enhancing evidence grounding.

Technology Category

Application Category

📝 Abstract
Retrieval-augmented generation (RAG) improves Large Language Models (LLMs) by incorporating external information into the response generation process. However, how context-faithful LLMs are and what factors influence LLMs' context-faithfulness remain largely unexplored. In this study, we investigate the impact of memory strength and evidence presentation on LLMs' receptiveness to external evidence. We introduce a method to quantify the memory strength of LLMs by measuring the divergence in LLMs' responses to different paraphrases of the same question, which is not considered by previous works. We also generate evidence in various styles to evaluate the effects of evidence in different styles. Two datasets are used for evaluation: Natural Questions (NQ) with popular questions and popQA featuring long-tail questions. Our results show that for questions with high memory strength, LLMs are more likely to rely on internal memory, particularly for larger LLMs such as GPT-4. On the other hand, presenting paraphrased evidence significantly increases LLMs' receptiveness compared to simple repetition or adding details.
Problem

Research questions and friction points this paper is trying to address.

Examining context-faithfulness in LLMs with memory strength.
Assessing impact of evidence style on LLM receptiveness.
Quantifying memory strength via response divergence to paraphrases.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Quantify memory strength via response divergence
Generate evidence in varied presentation styles
Enhance receptiveness with paraphrased evidence
🔎 Similar Papers
No similar papers found.
Y
Yuepei Li
Department of Computer Science, Iowa State University, Ames, Iowa, USA
K
Kang Zhou
Department of Computer Science, Iowa State University, Ames, Iowa, USA
Q
Qiao Qiao
Department of Computer Science, Iowa State University, Ames, Iowa, USA
B
Bach Nguyen
Department of Computer Science, Iowa State University, Ames, Iowa, USA
Q
Qing Wang
Department of Computer Science, Iowa State University, Ames, Iowa, USA
Q
Qi Li
Department of Computer Science, Iowa State University, Ames, Iowa, USA