🤖 AI Summary
Mainstream large language models (LLMs) exhibit insufficient cultural sensitivity and severe dialectal imbalance in Arabic language processing. Method: We introduce ArabInstruct—the first community-driven, instruction-tuning dataset covering all 22 Arab states over a one-year period—comprising paired instructions in Modern Standard Arabic and nationally representative dialects across 20 culturally salient domains. Annotation was conducted collaboratively by 44 local researchers using standardized guidelines, enabling dual-axis evaluation of cultural adaptation and dialect identification. Contribution/Results: Empirical analysis reveals pronounced geographic bias in existing LLMs: overrepresentation of Egyptian and Emirati varieties, while Iraqi, Mauritanian, and Yemeni variants are nearly absent. ArabInstruct establishes a benchmark resource and evaluation framework to advance cultural inclusivity and linguistic diversity in Arabic LLMs.
📝 Abstract
As large language models (LLMs) become increasingly integrated into daily life, ensuring their cultural sensitivity and inclusivity is paramount. We introduce our dataset, a year-long community-driven project covering all 22 Arab countries. The dataset includes instructions (input, response pairs) in both Modern Standard Arabic (MSA) and dialectal Arabic (DA), spanning 20 diverse topics. Built by a team of 44 researchers across the Arab world, all of whom are authors of this paper, our dataset offers a broad, inclusive perspective. We use our dataset to evaluate the cultural and dialectal capabilities of several frontier LLMs, revealing notable limitations. For instance, while closed-source LLMs generally exhibit strong performance, they are not without flaws, and smaller open-source models face greater challenges. Moreover, certain countries (e.g., Egypt, the UAE) appear better represented than others (e.g., Iraq, Mauritania, Yemen). Our annotation guidelines, code, and data for reproducibility are publicly available.