π€ AI Summary
This work addresses the absence of high-quality hallucination evaluation benchmarks for Arabic large language models that adequately reflect the languageβs linguistic, cultural, and reasoning characteristics. To this end, the authors present HalluScore, the first structured Arabic question-answering benchmark designed to systematically assess hallucinations across varying levels of reasoning difficulty, knowledge domains, historical periods, and cultural contexts. HalluScore employs a multi-label annotation scheme with human validation to enable fine-grained hallucination categorization. The dataset is constructed through a rigorous pipeline integrating fact-checking, model-driven filtering, and manual verification to ensure questions effectively elicit and identify hallucinatory responses. The final release comprises 827 questions, and evaluations across 17 Arabic and multilingual large language models reveal persistent challenges in cultural understanding and logical consistency.
π Abstract
Large language models (LLMs) have achieved remarkable progress in natural language generation, but remain susceptible to hallucination. In response to growing concerns about hallucinations, several benchmarks have been developed, primarily in English and Chinese. However, Arabic remains underrepresented, with limited benchmarks for LLMs hallucination due to scarce annotated resources and the language's morphological complexity. Consequently, existing benchmarks do not adequately reflect the linguistic, cultural, and reasoning characteristics of Arabic. To address this gap, we introduce HalluScore, a structured Arabic question answering benchmark designed to evaluate hallucination behavior in LLMs across different levels of reasoning difficulty, various knowledge domains, historical timelines, and culturally grounded Arabic scenarios. It contains 827 carefully curated questions for evaluating, detecting, and mitigating hallucination in LLMs. The dataset was constructed through a structured pipeline involving quality assurance, filtering for clarity and factual validity, and model-driven selection to retain questions that consistently trigger hallucinations. Each question is linked to verified ground-truth evidence, answer explanations, and multi-label annotations. Using the HalluScore benchmark, we conduct a comprehensive empirical analysis of hallucination patterns across 17 Arabic, multilingual, and reasoning LLMs. Moreover, we provide high-quality human annotations identifying hallucinated, non-hallucinated, and partially hallucinated responses of all evaluated LLMs. These results suggest that hallucination in Arabic LLMs extends beyond factual inaccuracies, encompassing challenges related to cultural understanding, linguistic reasoning, and logical consistency. We release HalluScore to support future research on improving the reliability and cultural competence of LLMs in Arabic.