🤖 AI Summary
Existing hallucination detection benchmarks predominantly rely on synthetic, human-constructed data, failing to capture the nuanced hallucination patterns arising in real-world LLM-human interactions—especially in high-stakes domains such as medicine and law. Method: We introduce AuthenHallu, the first hallucination detection benchmark grounded in authentic LLM-human dialogues, featuring expert-annotated high-quality conversations and comprehensive statistical analysis. Contribution/Results: AuthenHallu systematically characterizes hallucination distributions in realistic settings, revealing an overall hallucination rate of 31.4%—surging to 60% for mathematical/numerical tasks—significantly exceeding rates observed in synthetic benchmarks. Furthermore, we investigate the feasibility of leveraging base LLMs themselves as lightweight, zero-shot hallucination detectors. Our findings expose critical limitations of current detection methods under practical conditions, establishing a more ecologically valid evaluation paradigm for hallucination detection research.
📝 Abstract
As large language models (LLMs) are increasingly applied in sensitive domains such as medicine and law, hallucination detection has become a critical task. Although numerous benchmarks have been proposed to advance research in this area, most of them are artificially constructed--either through deliberate hallucination induction or simulated interactions--rather than derived from genuine LLM-human dialogues. Consequently, these benchmarks fail to fully capture the characteristics of hallucinations that occur in real-world usage. To address this limitation, we introduce AuthenHallu, the first hallucination detection benchmark built entirely from authentic LLM-human interactions. For AuthenHallu, we select and annotate samples from genuine LLM-human dialogues, thereby providing a faithful reflection of how LLMs hallucinate in everyday user interactions. Statistical analysis shows that hallucinations occur in 31.4% of the query-response pairs in our benchmark, and this proportion increases dramatically to 60.0% in challenging domains such as Math & Number Problems. Furthermore, we explore the potential of using vanilla LLMs themselves as hallucination detectors and find that, despite some promise, their current performance remains insufficient in real-world scenarios.