🤖 AI Summary
This work addresses the inaccurate hallucination detection by large language models (LLMs) in summarization tasks—particularly the weak faithfulness evaluation in retrieval-augmented generation (RAG) settings. We propose FaithJudge, an LLM-as-a-judge evaluation framework guided by few-shot human annotations. FaithJudge integrates human-annotated priors with self-supervised hallucination modeling to significantly improve automated hallucination detection accuracy. Furthermore, it introduces a dual-track evolutionary hallucination leaderboard that enables dynamic, fine-grained, and task-adaptive quantification and ranking of faithfulness. Extensive experiments demonstrate that FaithJudge outperforms state-of-the-art methods—including HHEM and the Vectara Leaderboard—across multiple benchmarks. By providing a more reliable, scalable, and principled benchmarking infrastructure, FaithJudge advances rigorous, interpretable, and application-aware faithfulness assessment for LLM-generated summaries.
📝 Abstract
Hallucinations remain a persistent challenge for LLMs. RAG aims to reduce hallucinations by grounding responses in contexts. However, even when provided context, LLMs still frequently introduce unsupported information or contradictions. This paper presents our efforts to measure LLM hallucinations with a focus on summarization tasks, assessing how often various LLMs introduce hallucinations when summarizing documents. We discuss Vectara's existing LLM hallucination leaderboard, based on the Hughes Hallucination Evaluation Model (HHEM). While HHEM and Vectara's Hallucination Leaderboard have garnered great research interest, we examine challenges faced by HHEM and current hallucination detection methods by analyzing the effectiveness of these methods on existing hallucination datasets. To address these limitations, we propose FaithJudge, an LLM-as-a-judge approach guided by few-shot human hallucination annotations, which substantially improves automated LLM hallucination evaluation over current methods. We introduce an enhanced hallucination leaderboard centered on FaithJudge, alongside our current hallucination leaderboard, enabling more reliable benchmarking of LLMs for hallucinations in RAG.