Cultural Benchmarking of LLMs in Standard and Dialectal Arabic Dialogues

📅 2026-04-30

📈 Citations: 0

✨ Influential: 0

career value

155K/year

🤖 AI Summary

This study addresses the limited scope of existing evaluations for large language models (LLMs) in Arabic, which predominantly focus on short texts in Modern Standard Arabic (MSA) while neglecting the cultural nuances and dialectal diversity inherent in everyday conversation. To bridge this gap, the authors introduce ArabCulture-Dialogue, a dialogue dataset spanning 13 Arab countries that encompasses both MSA and regional dialects across 12 everyday life topics and 54 subtopics. They further propose three evaluation tasks: cultural reasoning, dialect–MSA translation, and dialect-guided text generation. This work presents the first systematic assessment of LLMs’ cultural comprehension in multilingual, multi-dialectal conversational settings, thereby filling a critical void in culturally sensitive Arabic language evaluation. Experimental results reveal that state-of-the-art models perform significantly worse on dialect-related tasks than on MSA, underscoring their deficiencies in cultural and dialectal understanding.

📝 Abstract

There is a significant gap in evaluating cultural reasoning in LLMs using conversational datasets that capture culturally rich and dialectal contexts. Most Arabic benchmarks focus on short text snippets in Modern Standard Arabic (MSA), overlooking the cultural nuances that naturally arise in dialogues. To address this gap, we introduce ArabCulture-Dialogue, a culturally grounded conversational dataset covering 13 Arabic-speaking countries, in both MSA and each country's respective dialect, spanning 12 daily-life topics and 54 fine-grained subtopics. We utilize the dataset to form three benchmarking tasks: (i) multiple-choice cultural reasoning, (ii) machine translation between MSA and dialects, and (iii) dialect-steering generation. Our experiments indicate that the performance gap between MSA and Arabic dialects still exists, whereby the models perform worse on all three tasks in the dialectal setup, compared to the MSA one.

Problem

Research questions and friction points this paper is trying to address.

cultural reasoning

Arabic dialects

conversational dataset

Modern Standard Arabic

LLM benchmarking

Innovation

Methods, ideas, or system contributions that make the work stand out.

cultural reasoning

dialectal Arabic

conversational dataset