Detecting Hallucinations in Authentic LLM-Human Interactions

📅 2025-10-12

📈 Citations: 0

✨ Influential: 0

career value

210K/year

🤖 AI Summary

Existing hallucination detection benchmarks predominantly rely on synthetic, human-constructed data, failing to capture the nuanced hallucination patterns arising in real-world LLM-human interactions—especially in high-stakes domains such as medicine and law. Method: We introduce AuthenHallu, the first hallucination detection benchmark grounded in authentic LLM-human dialogues, featuring expert-annotated high-quality conversations and comprehensive statistical analysis. Contribution/Results: AuthenHallu systematically characterizes hallucination distributions in realistic settings, revealing an overall hallucination rate of 31.4%—surging to 60% for mathematical/numerical tasks—significantly exceeding rates observed in synthetic benchmarks. Furthermore, we investigate the feasibility of leveraging base LLMs themselves as lightweight, zero-shot hallucination detectors. Our findings expose critical limitations of current detection methods under practical conditions, establishing a more ecologically valid evaluation paradigm for hallucination detection research.

Technology Category

Application Category

📝 Abstract

As large language models (LLMs) are increasingly applied in sensitive domains such as medicine and law, hallucination detection has become a critical task. Although numerous benchmarks have been proposed to advance research in this area, most of them are artificially constructed--either through deliberate hallucination induction or simulated interactions--rather than derived from genuine LLM-human dialogues. Consequently, these benchmarks fail to fully capture the characteristics of hallucinations that occur in real-world usage. To address this limitation, we introduce AuthenHallu, the first hallucination detection benchmark built entirely from authentic LLM-human interactions. For AuthenHallu, we select and annotate samples from genuine LLM-human dialogues, thereby providing a faithful reflection of how LLMs hallucinate in everyday user interactions. Statistical analysis shows that hallucinations occur in 31.4% of the query-response pairs in our benchmark, and this proportion increases dramatically to 60.0% in challenging domains such as Math & Number Problems. Furthermore, we explore the potential of using vanilla LLMs themselves as hallucination detectors and find that, despite some promise, their current performance remains insufficient in real-world scenarios.

Problem

Research questions and friction points this paper is trying to address.

Detecting hallucinations in authentic human-LLM dialogue interactions

Addressing limitations of artificially constructed hallucination benchmarks

Evaluating LLMs' capability as hallucination detectors in real scenarios

Innovation

Methods, ideas, or system contributions that make the work stand out.

Constructed benchmark from authentic LLM-human dialogues

Annotated samples from genuine LLM-human interactions

Evaluated vanilla LLMs as hallucination detectors

🔎 Similar Papers

No similar papers found.