A review of faithfulness metrics for hallucination assessment in Large Language Models

📅 2024-12-31

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

This study addresses the challenge of evaluating faithfulness—i.e., factual consistency between generated text and source material—in large language model (LLM) outputs. We systematically survey and empirically compare faithfulness metrics across open-ended summarization, question answering, and machine translation tasks. For the first time, we conduct a unified, multi-task benchmark to assess metric validity, demonstrating that the LLM-as-a-judge paradigm achieves high alignment with human judgments (average Pearson *r* > 0.85), establishing it as the most reliable human-aligned evaluation standard to date. Furthermore, we propose and validate a dual-path framework combining retrieval-augmented generation (RAG) with structured prompting, which improves output faithfulness by 30–50% (relative gain). The derived strategies exhibit strong reproducibility and cross-task generalizability, offering practical, scalable solutions for enhancing factual consistency in LLM-generated content.

Technology Category

Application Category

📝 Abstract

This review examines the means with which faithfulness has been evaluated across open-ended summarization, question-answering and machine translation tasks. We find that the use of LLMs as a faithfulness evaluator is commonly the metric that is most highly correlated with human judgement. The means with which other studies have mitigated hallucinations is discussed, with both retrieval augmented generation (RAG) and prompting framework approaches having been linked with superior faithfulness, whilst other recommendations for mitigation are provided. Research into faithfulness is integral to the continued widespread use of LLMs, as unfaithful responses can pose major risks to many areas whereby LLMs would otherwise be suitable. Furthermore, evaluating open-ended generation provides a more comprehensive measure of LLM performance than commonly used multiple-choice benchmarking, which can help in advancing the trust that can be placed within LLMs.

Problem

Research questions and friction points this paper is trying to address.

Language Model Evaluation

Task Accuracy

Reliability in Applications

Innovation

Methods, ideas, or system contributions that make the work stand out.

Large Language Models

Generative Outputs Assessment

Prompting Techniques

🔎 Similar Papers

No similar papers found.

Scale AI

$264,800—$331,000 USD

San Francisco / New York / Seattle

Research Scientist Intern, Multimodal AI (PhD)