FaithEval: Can Your Language Model Stay Faithful to Context, Even If "The Moon is Made of Marshmallows"

📅 2024-09-30

🏛️ arXiv.org

📈 Citations: 14

✨ Influential: 5

career value

166K/year

🤖 AI Summary

This work addresses the critical issue of context faithfulness—i.e., the tendency of large language models (LLMs) and retrieval-augmented generation (RAG) systems to deviate from provided context in their responses. To this end, we introduce FaithEval, the first benchmark explicitly designed to evaluate context faithfulness. FaithEval comprises 4.9K high-quality, human-verified samples automatically annotated by LLMs, covering three realistic retrieval failure scenarios: unanswerable queries, contradictory contexts, and counterfactual contexts. We propose a novel evaluation framework targeting incomplete, conflicting, and fabricated contexts. Our systematic evaluation reveals no positive correlation between model scale and faithfulness—challenging the “bigger is more reliable” assumption. Results show that state-of-the-art LLMs and RAG systems achieve faithful accuracy below 50% on key tasks. FaithEval thus provides a reproducible, diagnosable, and principled metric for assessing trustworthy generation.

Technology Category

Application Category

📝 Abstract

Ensuring faithfulness to context in large language models (LLMs) and retrieval-augmented generation (RAG) systems is crucial for reliable deployment in real-world applications, as incorrect or unsupported information can erode user trust. Despite advancements on standard benchmarks, faithfulness hallucination-where models generate responses misaligned with the provided context-remains a significant challenge. In this work, we introduce FaithEval, a novel and comprehensive benchmark tailored to evaluate the faithfulness of LLMs in contextual scenarios across three diverse tasks: unanswerable, inconsistent, and counterfactual contexts. These tasks simulate real-world challenges where retrieval mechanisms may surface incomplete, contradictory, or fabricated information. FaithEval comprises 4.9K high-quality problems in total, validated through a rigorous four-stage context construction and validation framework, employing both LLM-based auto-evaluation and human validation. Our extensive study across a wide range of open-source and proprietary models reveals that even state-of-the-art models often struggle to remain faithful to the given context, and that larger models do not necessarily exhibit improved faithfulness.Project is available at: https://github.com/SalesforceAIResearch/FaithEval.

Problem

Research questions and friction points this paper is trying to address.

Evaluating faithfulness of LLMs to context in diverse scenarios

Addressing hallucination in models with inconsistent or fabricated information

Assessing model performance on unanswerable and counterfactual contexts

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces FaithEval benchmark for LLM faithfulness

Tests models on unanswerable, inconsistent, counterfactual contexts

Uses LLM-based auto-evaluation and human validation

🔎 Similar Papers

Does Liking Yellow Imply Driving a School Bus? Semantic Leakage in Language Models