Halluverse-M^3: A multitask multilingual benchmark for hallucination in LLMs

📅 2026-02-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the lack of fine-grained, cross-lingual benchmarks for evaluating hallucinations in large language models, which has been predominantly limited to English. The authors present the first multilingual hallucination evaluation benchmark spanning English, Arabic, Hindi, and Turkish, covering both question answering and dialogue summarization tasks. The benchmark explicitly distinguishes hallucinations at the entity, relation, and sentence levels. Hallucinated samples are generated through controlled text editing and validated by human annotators to ensure data quality, enabling fine-grained assessment of both open- and closed-source models. Experimental results reveal that hallucinations in question answering are easier to detect than those in dialogue summarization, with sentence-level hallucinations posing the greatest challenge. Models perform best in English, while detection performance drops significantly for low-resource languages such as Hindi.

Technology Category

Application Category

📝 Abstract
Hallucinations in large language models remain a persistent challenge, particularly in multilingual and generative settings where factual consistency is difficult to maintain. While recent models show strong performance on English-centric benchmarks, their behavior across languages, tasks, and hallucination types is not yet well understood. In this work, we introduce Halluverse-M^3, a dataset designed to enable systematic analysis of hallucinations across multiple languages, multiple generation tasks, and multiple hallucination categories. Halluverse-M^3 covers four languages, English, Arabic, Hindi, and Turkish, and supports two generation tasks: question answering and dialogue summarization. The dataset explicitly distinguishes between entity-level, relation-level, and sentence-level hallucinations. Hallucinated outputs are constructed through a controlled editing process and validated by human annotators, ensuring clear alignment between original content and hallucinated generations. Using this dataset, we evaluate a diverse set of contemporary open-source and proprietary language models on fine-grained hallucination detection. Our results show that question answering is consistently easier than dialogue summarization, while sentence-level hallucinations remain challenging even for the strongest models. Performance is highest in English and degrades in lower-resource languages, with Hindi exhibiting the lowest detection accuracy. Overall, Halluverse-M^3 provides a realistic and challenging benchmark for studying hallucinations in multilingual, multi-task settings. We release the dataset to support future research on hallucination detection and mitigation\footnote{https://huggingface.co/datasets/sabdalja/HalluVerse-M3}.
Problem

Research questions and friction points this paper is trying to address.

hallucination
multilingual
large language models
fact consistency
generation tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

multilingual hallucination
multi-task benchmark
controlled hallucination generation
fine-grained hallucination detection
large language models
🔎 Similar Papers