Rethinking Evaluation for LLM Hallucination Detection: A Desiderata, A New RAG-based Benchmark, New Insights

📅 2026-05-11

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

Existing hallucination detection benchmarks lack support for long-context retrieval-augmented generation (RAG) scenarios and overlook realistic label noise, limiting their effectiveness in evaluating detector performance. To address these gaps, this work establishes the first design principles for hallucination detection benchmarks and introduces TRIVIA+, the inaugural open-source benchmark that simultaneously supports ultra-long contexts and diverse forms of label noise—including both sample-dependent and sample-independent mechanisms. Comprehensive evaluations of state-of-the-art detectors and LLM-as-a-Judge baselines on TRIVIA+ reveal that current methods suffer significant performance degradation in RAG settings, label noise substantially impairs detection accuracy, and LLM-as-a-Judge emerges as a highly competitive approach, offering promising directions for future research.

📝 Abstract

Hallucination, broadly referring to unfaithful, fabricated, or inconsistent content generated by LLMs, has wide-ranging implications. Therefore, a large body of effort has been devoted to detecting LLM hallucinations, as well as designing benchmark datasets for evaluating these detectors. In this work, we first establish a desiderata of properties for hallucination detection benchmarks (HDBs) to exhibit for effective evaluation. A critical look at existing HDBs through the lens of our desiderata reveals that none of them exhibits all the properties. We identify two largest gaps: (1) RAG-based grounded benchmarks with long context are severely lacking (partly because length impedes human annotation); and (2) Existing benchmarks do not make available realistic label noise for stress-testing detectors although real-world use-cases often grapple with label noise due to human or automated/weak annotation. To close these gaps, we build and open-source a new RAG-based HDB called T RIVIA+ that underwent a rigorous human annotation process. Notably, our benchmark exhibits all desirable properties including (1) T RIVIA+ contains samples with the longest context in the literature; and (2) we design and share four sets of noisy labels with different, both sample-dependent and sampleindependent, noise schemes. Finally, we perform experiments on RAG-based HDBs, including our T RIVIA+, using popular SOTA detectors that reveal new insights: (i) ample room remains for current detectors to reach the performance ceiling on RAG-based HDBs, (ii) the basic LLM-as-a-Judge baseline performs competitively, and (iii) label noise hinders detection performance. We expect that our findings, along with our proposed benchmark 1 , will motivate and foster needed research on hallucination detection for RAG-based tasks.

Problem

Research questions and friction points this paper is trying to address.

hallucination detection

benchmark dataset

RAG-based

label noise

long context

Innovation

Methods, ideas, or system contributions that make the work stand out.

hallucination detection

RAG-based benchmark

label noise

long-context evaluation

TRIVIA+

🔎 Similar Papers

No similar papers found.