🤖 AI Summary
Large language models (LLMs) exhibit significant cultural bias in understanding Arabic and Islamic cultures, stemming from the Western-centric composition of their pretraining data—particularly pronounced on underrepresented topics. To address this, the PalmX 2025 shared task introduces the first dedicated evaluation framework for Arabic and Islamic cultural knowledge, comprising two subtasks: General Arabic Culture and General Islamic Culture. The benchmark consists of multiple-choice questions in Modern Standard Arabic, drawn from 22 countries and covering traditions, cuisine, history, religious practices, and linguistic expressions. Methodologically, participants employed task-specific fine-tuning, parameter-efficient adaptation (e.g., LoRA), and domain-aware data augmentation. The top-performing system achieved 72.15% and 84.22% accuracy on the Arabic and Islamic subtasks, respectively. These results validate culturally grounded modeling as a viable approach and highlight critical challenges—and actionable directions—in cross-cultural knowledge representation.
📝 Abstract
Large Language Models (LLMs) inherently reflect the vast data distributions they encounter during their pre-training phase. As this data is predominantly sourced from the web, there is a high chance it will be skewed towards high-resourced languages and cultures, such as those of the West. Consequently, LLMs often exhibit a diminished understanding of certain communities, a gap that is particularly evident in their knowledge of Arabic and Islamic cultures. This issue becomes even more pronounced with increasingly under-represented topics. To address this critical challenge, we introduce PalmX 2025, the first shared task designed to benchmark the cultural competence of LLMs in these specific domains. The task is composed of two subtasks featuring multiple-choice questions (MCQs) in Modern Standard Arabic (MSA): General Arabic Culture and General Islamic Culture. These subtasks cover a wide range of topics, including traditions, food, history, religious practices, and language expressions from across 22 Arab countries. The initiative drew considerable interest, with 26 teams registering for Subtask 1 and 19 for Subtask 2, culminating in nine and six valid submissions, respectively. Our findings reveal that task-specific fine-tuning substantially boosts performance over baseline models. The top-performing systems achieved an accuracy of 72.15% on cultural questions and 84.22% on Islamic knowledge. Parameter-efficient fine-tuning emerged as the predominant and most effective approach among participants, while the utility of data augmentation was found to be domain-dependent.