๐ค AI Summary
This study addresses the limited scope of existing evaluations for large language models (LLMs) in Arabic, which predominantly focus on short texts in Modern Standard Arabic (MSA) while neglecting the cultural nuances and dialectal diversity inherent in everyday conversation. To bridge this gap, the authors introduce ArabCulture-Dialogue, a dialogue dataset spanning 13 Arab countries that encompasses both MSA and regional dialects across 12 everyday life topics and 54 subtopics. They further propose three evaluation tasks: cultural reasoning, dialectโMSA translation, and dialect-guided text generation. This work presents the first systematic assessment of LLMsโ cultural comprehension in multilingual, multi-dialectal conversational settings, thereby filling a critical void in culturally sensitive Arabic language evaluation. Experimental results reveal that state-of-the-art models perform significantly worse on dialect-related tasks than on MSA, underscoring their deficiencies in cultural and dialectal understanding.
๐ Abstract
There is a significant gap in evaluating cultural reasoning in LLMs using conversational datasets that capture culturally rich and dialectal contexts. Most Arabic benchmarks focus on short text snippets in Modern Standard Arabic (MSA), overlooking the cultural nuances that naturally arise in dialogues. To address this gap, we introduce ArabCulture-Dialogue, a culturally grounded conversational dataset covering 13 Arabic-speaking countries, in both MSA and each country's respective dialect, spanning 12 daily-life topics and 54 fine-grained subtopics. We utilize the dataset to form three benchmarking tasks: (i) multiple-choice cultural reasoning, (ii) machine translation between MSA and dialects, and (iii) dialect-steering generation. Our experiments indicate that the performance gap between MSA and Arabic dialects still exists, whereby the models perform worse on all three tasks in the dialectal setup, compared to the MSA one.