Palm: A Culturally Inclusive and Linguistically Diverse Dataset for Arabic LLMs

📅 2025-02-28

📈 Citations: 0

✨ Influential: 0

career value

178K/year

🤖 AI Summary

Mainstream large language models (LLMs) exhibit insufficient cultural sensitivity and severe dialectal imbalance in Arabic language processing. Method: We introduce ArabInstruct—the first community-driven, instruction-tuning dataset covering all 22 Arab states over a one-year period—comprising paired instructions in Modern Standard Arabic and nationally representative dialects across 20 culturally salient domains. Annotation was conducted collaboratively by 44 local researchers using standardized guidelines, enabling dual-axis evaluation of cultural adaptation and dialect identification. Contribution/Results: Empirical analysis reveals pronounced geographic bias in existing LLMs: overrepresentation of Egyptian and Emirati varieties, while Iraqi, Mauritanian, and Yemeni variants are nearly absent. ArabInstruct establishes a benchmark resource and evaluation framework to advance cultural inclusivity and linguistic diversity in Arabic LLMs.

Technology Category

Application Category

📝 Abstract

As large language models (LLMs) become increasingly integrated into daily life, ensuring their cultural sensitivity and inclusivity is paramount. We introduce our dataset, a year-long community-driven project covering all 22 Arab countries. The dataset includes instructions (input, response pairs) in both Modern Standard Arabic (MSA) and dialectal Arabic (DA), spanning 20 diverse topics. Built by a team of 44 researchers across the Arab world, all of whom are authors of this paper, our dataset offers a broad, inclusive perspective. We use our dataset to evaluate the cultural and dialectal capabilities of several frontier LLMs, revealing notable limitations. For instance, while closed-source LLMs generally exhibit strong performance, they are not without flaws, and smaller open-source models face greater challenges. Moreover, certain countries (e.g., Egypt, the UAE) appear better represented than others (e.g., Iraq, Mauritania, Yemen). Our annotation guidelines, code, and data for reproducibility are publicly available.

Problem

Research questions and friction points this paper is trying to address.

Addressing cultural sensitivity in Arabic language models.

Evaluating dialectal diversity in large language models.

Identifying representation gaps across Arab countries in datasets.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Community-driven dataset for Arabic LLMs

Includes Modern Standard and dialectal Arabic

Evaluates cultural and dialectal LLM capabilities

🔎 Similar Papers

No similar papers found.