Detecting Hallucinations in Authentic LLM-Human Interactions

📅 2025-10-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing hallucination detection benchmarks predominantly rely on synthetic, human-constructed data, failing to capture the nuanced hallucination patterns arising in real-world LLM-human interactions—especially in high-stakes domains such as medicine and law. Method: We introduce AuthenHallu, the first hallucination detection benchmark grounded in authentic LLM-human dialogues, featuring expert-annotated high-quality conversations and comprehensive statistical analysis. Contribution/Results: AuthenHallu systematically characterizes hallucination distributions in realistic settings, revealing an overall hallucination rate of 31.4%—surging to 60% for mathematical/numerical tasks—significantly exceeding rates observed in synthetic benchmarks. Furthermore, we investigate the feasibility of leveraging base LLMs themselves as lightweight, zero-shot hallucination detectors. Our findings expose critical limitations of current detection methods under practical conditions, establishing a more ecologically valid evaluation paradigm for hallucination detection research.

Technology Category

Application Category

📝 Abstract
As large language models (LLMs) are increasingly applied in sensitive domains such as medicine and law, hallucination detection has become a critical task. Although numerous benchmarks have been proposed to advance research in this area, most of them are artificially constructed--either through deliberate hallucination induction or simulated interactions--rather than derived from genuine LLM-human dialogues. Consequently, these benchmarks fail to fully capture the characteristics of hallucinations that occur in real-world usage. To address this limitation, we introduce AuthenHallu, the first hallucination detection benchmark built entirely from authentic LLM-human interactions. For AuthenHallu, we select and annotate samples from genuine LLM-human dialogues, thereby providing a faithful reflection of how LLMs hallucinate in everyday user interactions. Statistical analysis shows that hallucinations occur in 31.4% of the query-response pairs in our benchmark, and this proportion increases dramatically to 60.0% in challenging domains such as Math & Number Problems. Furthermore, we explore the potential of using vanilla LLMs themselves as hallucination detectors and find that, despite some promise, their current performance remains insufficient in real-world scenarios.
Problem

Research questions and friction points this paper is trying to address.

Detecting hallucinations in authentic human-LLM dialogue interactions
Addressing limitations of artificially constructed hallucination benchmarks
Evaluating LLMs' capability as hallucination detectors in real scenarios
Innovation

Methods, ideas, or system contributions that make the work stand out.

Constructed benchmark from authentic LLM-human dialogues
Annotated samples from genuine LLM-human interactions
Evaluated vanilla LLMs as hallucination detectors
🔎 Similar Papers
No similar papers found.
Y
Yujie Ren
Data Science Group, University of Hamburg, Germany
N
Niklas Gruhlke
Data Science Group, University of Hamburg, Germany
Anne Lauscher
Anne Lauscher
Professor of Data Science at the University of Hamburg
Natural Language ProcessingEthics and AIComputational Argumentation