MultiHal: Multilingual Dataset for Knowledge-Graph Grounded Evaluation of LLM Hallucinations

📅 2025-05-20

📈 Citations: 0

✨ Influential: 0

career value

154K/year

🤖 AI Summary

Existing hallucination evaluation benchmarks are predominantly English-centric and lack structured knowledge support, limiting comprehensive assessment of factual consistency in multilingual large language models (LLMs). Method: We propose ML-KGHallu, the first knowledge graph (KG)-based, multilingual, multi-hop generative hallucination benchmark. It encompasses 25.9k high-quality multilingual KG paths derived from 140k raw paths, integrating structured KG paths with multi-hop reasoning, cross-lingual entity alignment, and KG-augmented retrieval-augmented generation (RAG) to construct a multilingual generative question-answering evaluation framework. Contribution/Results: Experiments demonstrate that KG-RAG improves semantic similarity by 0.12–0.36 over baseline QA across multiple languages and LLMs, substantially mitigating hallucinations. ML-KGHallu fills a critical gap in non-English, structured, and multi-hop factual consistency evaluation.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) have inherent limitations of faithfulness and factuality, commonly referred to as hallucinations. Several benchmarks have been developed that provide a test bed for factuality evaluation within the context of English-centric datasets, while relying on supplementary informative context like web links or text passages but ignoring the available structured factual resources. To this end, Knowledge Graphs (KGs) have been identified as a useful aid for hallucination mitigation, as they provide a structured way to represent the facts about entities and their relations with minimal linguistic overhead. We bridge the lack of KG paths and multilinguality for factual language modeling within the existing hallucination evaluation benchmarks and propose a KG-based multilingual, multihop benchmark called extbf{MultiHal} framed for generative text evaluation. As part of our data collection pipeline, we mined 140k KG-paths from open-domain KGs, from which we pruned noisy KG-paths, curating a high-quality subset of 25.9k. Our baseline evaluation shows an absolute scale increase by approximately 0.12 to 0.36 points for the semantic similarity score in KG-RAG over vanilla QA across multiple languages and multiple models, demonstrating the potential of KG integration. We anticipate MultiHal will foster future research towards several graph-based hallucination mitigation and fact-checking tasks.

Problem

Research questions and friction points this paper is trying to address.

Evaluating LLM hallucinations using multilingual KG-based benchmarks

Addressing lack of structured factual resources in current evaluations

Improving factuality via KG integration across multiple languages

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses Knowledge Graphs for hallucination mitigation

Multilingual KG-based benchmark for LLM evaluation

Integrates KG-RAG for improved semantic similarity

🔎 Similar Papers

Reefknot: A Comprehensive Benchmark for Relation Hallucination Evaluation, Analysis and Mitigation in Multimodal Large Language Models