Benchmarking LLM Faithfulness in RAG with Evolving Leaderboards

📅 2025-05-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the inaccurate hallucination detection by large language models (LLMs) in summarization tasks—particularly the weak faithfulness evaluation in retrieval-augmented generation (RAG) settings. We propose FaithJudge, an LLM-as-a-judge evaluation framework guided by few-shot human annotations. FaithJudge integrates human-annotated priors with self-supervised hallucination modeling to significantly improve automated hallucination detection accuracy. Furthermore, it introduces a dual-track evolutionary hallucination leaderboard that enables dynamic, fine-grained, and task-adaptive quantification and ranking of faithfulness. Extensive experiments demonstrate that FaithJudge outperforms state-of-the-art methods—including HHEM and the Vectara Leaderboard—across multiple benchmarks. By providing a more reliable, scalable, and principled benchmarking infrastructure, FaithJudge advances rigorous, interpretable, and application-aware faithfulness assessment for LLM-generated summaries.

Technology Category

Application Category

📝 Abstract
Hallucinations remain a persistent challenge for LLMs. RAG aims to reduce hallucinations by grounding responses in contexts. However, even when provided context, LLMs still frequently introduce unsupported information or contradictions. This paper presents our efforts to measure LLM hallucinations with a focus on summarization tasks, assessing how often various LLMs introduce hallucinations when summarizing documents. We discuss Vectara's existing LLM hallucination leaderboard, based on the Hughes Hallucination Evaluation Model (HHEM). While HHEM and Vectara's Hallucination Leaderboard have garnered great research interest, we examine challenges faced by HHEM and current hallucination detection methods by analyzing the effectiveness of these methods on existing hallucination datasets. To address these limitations, we propose FaithJudge, an LLM-as-a-judge approach guided by few-shot human hallucination annotations, which substantially improves automated LLM hallucination evaluation over current methods. We introduce an enhanced hallucination leaderboard centered on FaithJudge, alongside our current hallucination leaderboard, enabling more reliable benchmarking of LLMs for hallucinations in RAG.
Problem

Research questions and friction points this paper is trying to address.

Measure LLM hallucinations in summarization tasks
Evaluate limitations of current hallucination detection methods
Propose FaithJudge for improved automated hallucination evaluation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Proposes FaithJudge for hallucination evaluation
Uses few-shot human annotations for guidance
Introduces enhanced hallucination leaderboard
🔎 Similar Papers
No similar papers found.