HalluLens: LLM Hallucination Benchmark

📅 2025-04-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
LLMs’ “hallucinations”—factually incorrect outputs contradicting either input prompts or training data—severely undermine their trustworthiness and practical applicability. Existing work suffers from ambiguous definitions, inconsistent taxonomies, and saturated or data-leaking evaluation benchmarks. To address these limitations, we propose the first comprehensive hallucination benchmark for LLMs. Our method introduces three core innovations: (1) a novel decoupling of “hallucination” from “factualness,” establishing a principled dichotomy between *exogenous* hallucinations (violating training knowledge) and *endogenous* hallucinations (violating user-provided input); (2) a dynamic test-set generation mechanism to prevent data contamination; and (3) a multi-dimensional consistency evaluation framework, coupled with a systematic critique of prevailing benchmarks. The open-sourced benchmark uncovers deep-rooted issues—including factual confusion—and provides a standardized, infrastructure-level assessment platform for advancing LLM reliability research.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) often generate responses that deviate from user input or training data, a phenomenon known as"hallucination."These hallucinations undermine user trust and hinder the adoption of generative AI systems. Addressing hallucinations is essential for the advancement of LLMs. This paper introduces a comprehensive hallucination benchmark, incorporating both new extrinsic and existing intrinsic evaluation tasks, built upon clear taxonomy of hallucination. A major challenge in benchmarking hallucinations is the lack of a unified framework due to inconsistent definitions and categorizations. We disentangle LLM hallucination from"factuality,"proposing a clear taxonomy that distinguishes between extrinsic and intrinsic hallucinations, to promote consistency and facilitate research. Extrinsic hallucinations, where the generated content is not consistent with the training data, are increasingly important as LLMs evolve. Our benchmark includes dynamic test set generation to mitigate data leakage and ensure robustness against such leakage. We also analyze existing benchmarks, highlighting their limitations and saturation. The work aims to: (1) establish a clear taxonomy of hallucinations, (2) introduce new extrinsic hallucination tasks, with data that can be dynamically regenerated to prevent saturation by leakage, (3) provide a comprehensive analysis of existing benchmarks, distinguishing them from factuality evaluations.
Problem

Research questions and friction points this paper is trying to address.

Define clear taxonomy for LLM hallucination types
Develop dynamic benchmark to prevent data leakage
Distinguish hallucination from factuality in LLM outputs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces clear hallucination taxonomy framework
Dynamic test set prevents data leakage
Distinguishes extrinsic intrinsic hallucinations
🔎 Similar Papers
No similar papers found.