HalluVerse25: Fine-grained Multilingual Benchmark Dataset for LLM Hallucinations

📅 2025-03-10

📈 Citations: 0

✨ Influential: 0

career value

192K/year

🤖 AI Summary

Existing hallucination evaluation datasets lack comprehensive coverage of fine-grained hallucinations—specifically at the entity, relation, and sentence levels—across multilingual settings. Method: This paper introduces HalluVerse25, the first fine-grained multilingual hallucination benchmark for English, Arabic, and Turkish. It proposes a novel paradigm combining LLM-controlled hallucination injection with collaborative annotation by trilingual domain experts, augmented by a cross-lingual consistency verification mechanism. Contribution/Results: HalluVerse25 enables the first-ever fine-grained, multilingual hallucination classification annotation, substantially improving sample authenticity and inter-annotator agreement. Evaluation on HalluVerse25 reveals that proprietary models consistently outperform open-source counterparts in hallucination detection, while all models exhibit the weakest performance on Arabic—highlighting a critical bottleneck in current multilingual hallucination identification capabilities.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) are increasingly used in various contexts, yet remain prone to generating non-factual content, commonly referred to as"hallucinations". The literature categorizes hallucinations into several types, including entity-level, relation-level, and sentence-level hallucinations. However, existing hallucination datasets often fail to capture fine-grained hallucinations in multilingual settings. In this work, we introduce HalluVerse25, a multilingual LLM hallucination dataset that categorizes fine-grained hallucinations in English, Arabic, and Turkish. Our dataset construction pipeline uses an LLM to inject hallucinations into factual biographical sentences, followed by a rigorous human annotation process to ensure data quality. We evaluate several LLMs on HalluVerse25, providing valuable insights into how proprietary models perform in detecting LLM-generated hallucinations across different contexts.

Problem

Research questions and friction points this paper is trying to address.

Addresses LLM-generated non-factual content (hallucinations)

Identifies fine-grained hallucinations in multilingual settings

Evaluates LLM performance in detecting hallucinations across languages

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multilingual dataset for fine-grained hallucinations

LLM-injected hallucinations with human annotation

Evaluation of proprietary models on hallucination detection

🔎 Similar Papers

Hallucination of Multimodal Large Language Models: A Survey