KGHaluBench: A Knowledge Graph-Based Hallucination Benchmark for Evaluating the Breadth and Depth of LLM Knowledge

📅 2026-02-23

📈 Citations: 0

✨ Influential: 0

career value

198K/year

🤖 AI Summary

Existing benchmarks for evaluating hallucinations in large language models (LLMs) suffer from static question sets and narrow coverage, limiting their ability to comprehensively assess the truthfulness and depth of model knowledge. This work proposes KGHaluBench, the first hallucination benchmark that dynamically generates challenging questions leveraging knowledge graphs to ensure both breadth and depth. It incorporates statistical difficulty modeling to mitigate popularity bias and employs multi-level automated validation—encompassing conceptual consistency and factual correctness—alongside an interpretable hallucination taxonomy and a novel accuracy metric to enable fine-grained analysis of both hallucinations and refusal behaviors. Systematic evaluation across 25 state-of-the-art LLMs reveals the nuanced impacts of model scale and knowledge factors on hallucination tendencies. The benchmark is publicly released to support future research.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) possess a remarkable capacity to generate persuasive and intelligible language. However, coherence does not equate to truthfulness, as the responses often contain subtle hallucinations. Existing benchmarks are limited by static and narrow questions, leading to limited coverage and misleading evaluations. We present KGHaluBench, a Knowledge Graph-based hallucination benchmark that assesses LLMs across the breadth and depth of their knowledge, providing a fairer and more comprehensive insight into LLM truthfulness. Our framework utilises the KG to dynamically construct challenging, multifaceted questions, whose difficulty is then statistically estimated to address popularity bias. Our automated verification pipeline detects abstentions and verifies the LLM's response at both conceptual and correctness levels to identify different types of hallucinations. We evaluate 25 frontier models, using novel accuracy and hallucination metrics. The results provide a more interpretable insight into the knowledge factors that cause hallucinations across different model sizes. KGHaluBench is publicly available to support future developments in hallucination mitigation.

Problem

Research questions and friction points this paper is trying to address.

hallucination

large language models

knowledge graph

benchmark

truthfulness

Innovation

Methods, ideas, or system contributions that make the work stand out.

Knowledge Graph

Hallucination Benchmark

Dynamic Question Generation