HALoGEN: Fantastic LLM Hallucinations and Where to Find Them

πŸ“… 2025-01-14
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Large language models (LLMs) suffer from pervasive factual hallucinations, yet existing detection methods exhibit low accuracy and poor scalability. To address this, we propose HALoGENβ€”the first high-accuracy, fully automated hallucination evaluation benchmark, comprising 10,923 prompts across nine domains. HALoGEN introduces a novel three-way hallucination taxonomy (Types A/B/C) distinguishing memory bias, training-data errors, and unsupported fabrication. It further proposes a knowledge-graph-driven framework for atomic fact decomposition and retrieval-augmented verification, integrating authoritative structured knowledge sources (e.g., Wikidata, PubMed) to compute multi-granularity consistency scores. Evaluating over 150,000 generations from 14 state-of-the-art LLMs, we find up to 86% of atomic facts are hallucinated, with performance varying significantly across domains. HALoGEN provides a reproducible, scalable, and quantitative benchmark for developing trustworthy LLMs.

Technology Category

Application Category

πŸ“ Abstract
Despite their impressive ability to generate high-quality and fluent text, generative large language models (LLMs) also produce hallucinations: statements that are misaligned with established world knowledge or provided input context. However, measuring hallucination can be challenging, as having humans verify model generations on-the-fly is both expensive and time-consuming. In this work, we release HALoGEN, a comprehensive hallucination benchmark consisting of: (1) 10,923 prompts for generative models spanning nine domains including programming, scientific attribution, and summarization, and (2) automatic high-precision verifiers for each use case that decompose LLM generations into atomic units, and verify each unit against a high-quality knowledge source. We use this framework to evaluate ~150,000 generations from 14 language models, finding that even the best-performing models are riddled with hallucinations (sometimes up to 86% of generated atomic facts depending on the domain). We further define a novel error classification for LLM hallucinations based on whether they likely stem from incorrect recollection of training data (Type A errors), or incorrect knowledge in training data (Type B errors), or are fabrication (Type C errors). We hope our framework provides a foundation to enable the principled study of why generative models hallucinate, and advances the development of trustworthy large language models.
Problem

Research questions and friction points this paper is trying to address.

Large Language Models
Factuality Errors
Detection Methods
Innovation

Methods, ideas, or system contributions that make the work stand out.

HALoGEN
LLM Error Detection
Automated Verification
πŸ”Ž Similar Papers
No similar papers found.