Cultural Benchmarking of LLMs in Standard and Dialectal Arabic Dialogues

๐Ÿ“… 2026-04-30
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF

career value

171K/year
๐Ÿค– AI Summary
This study addresses the limited scope of existing evaluations for large language models (LLMs) in Arabic, which predominantly focus on short texts in Modern Standard Arabic (MSA) while neglecting the cultural nuances and dialectal diversity inherent in everyday conversation. To bridge this gap, the authors introduce ArabCulture-Dialogue, a dialogue dataset spanning 13 Arab countries that encompasses both MSA and regional dialects across 12 everyday life topics and 54 subtopics. They further propose three evaluation tasks: cultural reasoning, dialectโ€“MSA translation, and dialect-guided text generation. This work presents the first systematic assessment of LLMsโ€™ cultural comprehension in multilingual, multi-dialectal conversational settings, thereby filling a critical void in culturally sensitive Arabic language evaluation. Experimental results reveal that state-of-the-art models perform significantly worse on dialect-related tasks than on MSA, underscoring their deficiencies in cultural and dialectal understanding.
๐Ÿ“ Abstract
There is a significant gap in evaluating cultural reasoning in LLMs using conversational datasets that capture culturally rich and dialectal contexts. Most Arabic benchmarks focus on short text snippets in Modern Standard Arabic (MSA), overlooking the cultural nuances that naturally arise in dialogues. To address this gap, we introduce ArabCulture-Dialogue, a culturally grounded conversational dataset covering 13 Arabic-speaking countries, in both MSA and each country's respective dialect, spanning 12 daily-life topics and 54 fine-grained subtopics. We utilize the dataset to form three benchmarking tasks: (i) multiple-choice cultural reasoning, (ii) machine translation between MSA and dialects, and (iii) dialect-steering generation. Our experiments indicate that the performance gap between MSA and Arabic dialects still exists, whereby the models perform worse on all three tasks in the dialectal setup, compared to the MSA one.
Problem

Research questions and friction points this paper is trying to address.

cultural reasoning
Arabic dialects
conversational dataset
Modern Standard Arabic
LLM benchmarking
Innovation

Methods, ideas, or system contributions that make the work stand out.

cultural reasoning
dialectal Arabic
conversational dataset
benchmarking
machine translation
๐Ÿ”Ž Similar Papers
No similar papers found.
Muhammad Dehan Al Kautsar
Muhammad Dehan Al Kautsar
Mohamed bin Zayed University of Artificial Intelligence
Natural Language ProcessingMultilingualityHuman-Centered NLP
S
Saeed Almheiri
Mohamed bin Zayed University of Artificial Intelligence
M
Momina Ahsan
Mohamed bin Zayed University of Artificial Intelligence
B
Bilal Elbouardi
Mohamed bin Zayed University of Artificial Intelligence
Younes Samih
Younes Samih
IBM Research AI, IBM
LLMsNLPArabic NLP
S
Sarfraz Ahmad
Mohamed bin Zayed University of Artificial Intelligence
A
Amr Keleg
Mohamed bin Zayed University of Artificial Intelligence
O
Omar El Herraoui
Mohamed bin Zayed University of Artificial Intelligence
K
Kareem Elzeky
Mohamed bin Zayed University of Artificial Intelligence
Abed Alhakim Freihat
Abed Alhakim Freihat
University of Trento
Natural language processingLexical semanticsOntologyWordNet
M
Mohamed Anwar
Mohamed bin Zayed University of Artificial Intelligence
Zhuohan Xie
Zhuohan Xie
MBZUAI
Financial AIReasoningNatural Language ProcessingComputational LinguisticsDeep Learning
J
Junhong Liang
Mohamed bin Zayed University of Artificial Intelligence
M
Mohammad Rustom Al Nasar
American University in the Emirates
Preslav Nakov
Preslav Nakov
Mohamed bin Zayed University of Artificial Intelligence (MBZUAI)
Computational LinguisticsLarge Language ModelsFact-checkingFake News
Fajri Koto
Fajri Koto
Assistant Professor (tenure-track), MBZUAI
Computational LinguisticsNatural Language ProcessingMultilingual NLPHuman-centered NLP