Palm: A Culturally Inclusive and Linguistically Diverse Dataset for Arabic LLMs

📅 2025-02-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Mainstream large language models (LLMs) exhibit insufficient cultural sensitivity and severe dialectal imbalance in Arabic language processing. Method: We introduce ArabInstruct—the first community-driven, instruction-tuning dataset covering all 22 Arab states over a one-year period—comprising paired instructions in Modern Standard Arabic and nationally representative dialects across 20 culturally salient domains. Annotation was conducted collaboratively by 44 local researchers using standardized guidelines, enabling dual-axis evaluation of cultural adaptation and dialect identification. Contribution/Results: Empirical analysis reveals pronounced geographic bias in existing LLMs: overrepresentation of Egyptian and Emirati varieties, while Iraqi, Mauritanian, and Yemeni variants are nearly absent. ArabInstruct establishes a benchmark resource and evaluation framework to advance cultural inclusivity and linguistic diversity in Arabic LLMs.

Technology Category

Application Category

📝 Abstract
As large language models (LLMs) become increasingly integrated into daily life, ensuring their cultural sensitivity and inclusivity is paramount. We introduce our dataset, a year-long community-driven project covering all 22 Arab countries. The dataset includes instructions (input, response pairs) in both Modern Standard Arabic (MSA) and dialectal Arabic (DA), spanning 20 diverse topics. Built by a team of 44 researchers across the Arab world, all of whom are authors of this paper, our dataset offers a broad, inclusive perspective. We use our dataset to evaluate the cultural and dialectal capabilities of several frontier LLMs, revealing notable limitations. For instance, while closed-source LLMs generally exhibit strong performance, they are not without flaws, and smaller open-source models face greater challenges. Moreover, certain countries (e.g., Egypt, the UAE) appear better represented than others (e.g., Iraq, Mauritania, Yemen). Our annotation guidelines, code, and data for reproducibility are publicly available.
Problem

Research questions and friction points this paper is trying to address.

Addressing cultural sensitivity in Arabic language models.
Evaluating dialectal diversity in large language models.
Identifying representation gaps across Arab countries in datasets.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Community-driven dataset for Arabic LLMs
Includes Modern Standard and dialectal Arabic
Evaluates cultural and dialectal LLM capabilities
🔎 Similar Papers
No similar papers found.
Fakhraddin Alwajih
Fakhraddin Alwajih
Postdoctoral Fellow Researcher @ UBC
Artificial Intelligence Machine Learning Natural Language Processing
A
Abdellah El Mekki
MBZUAI
S
Samar Mohamed Magdy
MBZUAI
A
Abdelrahim A. Elmadany
The University of British Columbia
O
Omer Nacar
Prince Sultan University
El Moatez Billah Nagoudi
El Moatez Billah Nagoudi
The University of British Columbia
Natural Language ProcessingMachin LearningDeep LearningArabic Language.
Reem Abdel-Salam
Reem Abdel-Salam
MSc student at Faculty of Engineering, Computer Department, Cairo University
Deep learningComputer VisionImage Processing
H
Hanin Atwany
MBZUAI
Youssef Nafea
Youssef Nafea
Masters Student at MBZUAI
Deep LearningLLMsNatural Language ProcessingSpeech Processing
A
Abdulfattah Mohammed Yahya
Cairo University
R
Rahaf Alhamouri
JUST
H
Hamzah A. Alsayadi
H
Hiba Zayed
Birzeit University
S
Sara Shatnawi
MBZUAI
Serry Sibaee
Serry Sibaee
Research Engineer
Arabic Natural Language processingNLP
Y
Yasir Ech-Chammakhy
UM6P
W
Walid Al-Dhabyani
Cairo University
M
Marwa Mohamed Ali
Ain Shams University
Imen Jarraya
Imen Jarraya
Doctor
Computer scienceRenewable energyElectric VehicleROS1and ROS
Ahmed Oumar El-Shangiti
Ahmed Oumar El-Shangiti
Mohamed Bin Zayed University of Artificial Intelligence (MBZUAI) | Riken AIP
Machine learningNatural language processingInterpretabilityGenerative AIMachine Translation
A
Aisha Alraeesi
MBZUAI
M
Mohammed Anwar Al-Ghrawi
Damascus University
Abdulrahman S. Al-Batati
Abdulrahman S. Al-Batati
Robotics and Internet-of-Things Laboratory, Prince Sultan University, Riyadh, Saudi Arabia
Robotics3D Printing
E
Elgizouli Mohamed
N
Noha Taha Elgindi
Menoufiya University
M
Muhammed Saeed
MBZUAI
H
Houdaifa Atou
UM6P
I
Issam Ait Yahia
UM6P
Abdelhak Bouayad
Abdelhak Bouayad
PhD candidate
Machine LearningDeep LearningPrivacyFederated Learning
M
Mohammed Machrouh
UM6P
A
Amal Makouar
UM6P
D
Dania Alkawi
Prince Sultan University
M
Mukhtar Mohamed
MBZUAI
S
Safaa Taher Abdelfadil
MBZUAI
Amine Ziad Ounnoughene
Amine Ziad Ounnoughene
Sorbonne Université
Machine learningSpiking neural networks
R
Rouabhia Anfel
Full Sail University
R
Rwaa Assi
Birzeit University
A
Ahmed Sorkatti
University of Khartoum
M
Mohamedou Cheikh Tourad
University of Nouakchott
Anis Koubaa
Anis Koubaa
Alfaisal University
Large Language ModelsGenerative AIUnmanned Aerial VehiclesRobot Operating System
I
Ismail Berrada
UM6P
Mustafa Jarrar
Mustafa Jarrar
Professor, Hamad Bin Khalifa University, Qatar - Birzeit University, Palestine
Arabic Natural Language ProcessingSocial ComputingOntology EngineeringKnowledge Graphs
Shady Shehata
Shady Shehata
University of Waterloo
Artificial IntelligenceNatural Language Processing
Muhammad Abdul-Mageed
Muhammad Abdul-Mageed
The University of British Columbia
Natural Language ProcessingDeep Learning