DialectalArabicMMLU: Benchmarking Dialectal Capabilities in Arabic and Multilingual Language Models

📅 2025-10-31
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing Arabic evaluation benchmarks focus predominantly on Modern Standard Arabic (MSA), neglecting widely spoken colloquial dialects. Method: We introduce DialectalArabicMMLU—the first unified, human-annotated benchmark for Arabic dialects—built upon the MMLU-Redux framework. It comprises 15K high-quality, expert-translated and localized question-answer pairs across 32 domains, covering five major dialects (Syrian, Egyptian, Emirati, Saudi, and Moroccan), with parallel English and MSA references. Contribution/Results: DialectalArabicMMLU is the first benchmark to jointly support task-oriented evaluation and linguistic analysis. Evaluation across 19 open-source Arabic and multilingual LMs reveals consistently poor dialect comprehension and reasoning, with substantial performance variance across dialects—highlighting a critical bottleneck in dialectal generalization capability.

Technology Category

Application Category

📝 Abstract
We present DialectalArabicMMLU, a new benchmark for evaluating the performance of large language models (LLMs) across Arabic dialects. While recently developed Arabic and multilingual benchmarks have advanced LLM evaluation for Modern Standard Arabic (MSA), dialectal varieties remain underrepresented despite their prevalence in everyday communication. DialectalArabicMMLU extends the MMLU-Redux framework through manual translation and adaptation of 3K multiple-choice question-answer pairs into five major dialects (Syrian, Egyptian, Emirati, Saudi, and Moroccan), yielding a total of 15K QA pairs across 32 academic and professional domains (22K QA pairs when also including English and MSA). The benchmark enables systematic assessment of LLM reasoning and comprehension beyond MSA, supporting both task-based and linguistic analysis. We evaluate 19 open-weight Arabic and multilingual LLMs (1B-13B parameters) and report substantial performance variation across dialects, revealing persistent gaps in dialectal generalization. DialectalArabicMMLU provides the first unified, human-curated resource for measuring dialectal understanding in Arabic, thus promoting more inclusive evaluation and future model development.
Problem

Research questions and friction points this paper is trying to address.

Evaluating LLM performance across five major Arabic dialects
Addressing underrepresentation of dialectal varieties in Arabic benchmarks
Measuring reasoning gaps in dialectal generalization for Arabic LLMs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Manually translated MMLU framework into five Arabic dialects
Created 15K dialectal QA pairs across 32 domains
Established first unified benchmark for Arabic dialect evaluation
🔎 Similar Papers
No similar papers found.