🤖 AI Summary
This study addresses the challenge of evaluating faithfulness—i.e., factual consistency between generated text and source material—in large language model (LLM) outputs. We systematically survey and empirically compare faithfulness metrics across open-ended summarization, question answering, and machine translation tasks. For the first time, we conduct a unified, multi-task benchmark to assess metric validity, demonstrating that the LLM-as-a-judge paradigm achieves high alignment with human judgments (average Pearson *r* > 0.85), establishing it as the most reliable human-aligned evaluation standard to date. Furthermore, we propose and validate a dual-path framework combining retrieval-augmented generation (RAG) with structured prompting, which improves output faithfulness by 30–50% (relative gain). The derived strategies exhibit strong reproducibility and cross-task generalizability, offering practical, scalable solutions for enhancing factual consistency in LLM-generated content.
📝 Abstract
This review examines the means with which faithfulness has been evaluated across open-ended summarization, question-answering and machine translation tasks. We find that the use of LLMs as a faithfulness evaluator is commonly the metric that is most highly correlated with human judgement. The means with which other studies have mitigated hallucinations is discussed, with both retrieval augmented generation (RAG) and prompting framework approaches having been linked with superior faithfulness, whilst other recommendations for mitigation are provided. Research into faithfulness is integral to the continued widespread use of LLMs, as unfaithful responses can pose major risks to many areas whereby LLMs would otherwise be suitable. Furthermore, evaluating open-ended generation provides a more comprehensive measure of LLM performance than commonly used multiple-choice benchmarking, which can help in advancing the trust that can be placed within LLMs.