KGHaluBench: A Knowledge Graph-Based Hallucination Benchmark for Evaluating the Breadth and Depth of LLM Knowledge

📅 2026-02-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing benchmarks for evaluating hallucinations in large language models (LLMs) suffer from static question sets and narrow coverage, limiting their ability to comprehensively assess the truthfulness and depth of model knowledge. This work proposes KGHaluBench, the first hallucination benchmark that dynamically generates challenging questions leveraging knowledge graphs to ensure both breadth and depth. It incorporates statistical difficulty modeling to mitigate popularity bias and employs multi-level automated validation—encompassing conceptual consistency and factual correctness—alongside an interpretable hallucination taxonomy and a novel accuracy metric to enable fine-grained analysis of both hallucinations and refusal behaviors. Systematic evaluation across 25 state-of-the-art LLMs reveals the nuanced impacts of model scale and knowledge factors on hallucination tendencies. The benchmark is publicly released to support future research.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) possess a remarkable capacity to generate persuasive and intelligible language. However, coherence does not equate to truthfulness, as the responses often contain subtle hallucinations. Existing benchmarks are limited by static and narrow questions, leading to limited coverage and misleading evaluations. We present KGHaluBench, a Knowledge Graph-based hallucination benchmark that assesses LLMs across the breadth and depth of their knowledge, providing a fairer and more comprehensive insight into LLM truthfulness. Our framework utilises the KG to dynamically construct challenging, multifaceted questions, whose difficulty is then statistically estimated to address popularity bias. Our automated verification pipeline detects abstentions and verifies the LLM's response at both conceptual and correctness levels to identify different types of hallucinations. We evaluate 25 frontier models, using novel accuracy and hallucination metrics. The results provide a more interpretable insight into the knowledge factors that cause hallucinations across different model sizes. KGHaluBench is publicly available to support future developments in hallucination mitigation.
Problem

Research questions and friction points this paper is trying to address.

hallucination
large language models
knowledge graph
benchmark
truthfulness
Innovation

Methods, ideas, or system contributions that make the work stand out.

Knowledge Graph
Hallucination Benchmark
Dynamic Question Generation
Automated Verification
Popularity Bias Mitigation
🔎 Similar Papers
No similar papers found.
A
Alex Robertson
School of Computing, Newcastle University
Huizhi Liang
Huizhi Liang
Newcastle University
Data MiningMachine LearningPersonalizationRecommender Systems
M
Mahbub Gani
Sage Ai, Sage Group PLC
R
Rohit Kumar
Sage Ai, Sage Group PLC
S
Srijith Rajamohan
Sage Ai, Sage Group PLC