🤖 AI Summary
This study addresses the persistent challenge of machine translation systems in handling the diverse dialectal variants of Arabic, which hinders effective communication for millions of native speakers. To bridge this gap, the authors introduce the Alexandria dataset—a large-scale, community-driven resource comprising 107,000 human-translated utterances from multi-turn dialogues across 13 Arabic-speaking countries and 11 high-impact domains. Alexandria uniquely incorporates fine-grained city-level dialect annotations and explicit speaker–listener gender configurations, moving beyond conventional coarse-grained regional labels. The dataset is validated through a dual-track evaluation framework combining automated metrics and human assessment. Experimental results demonstrate significant shortcomings of current large language models in dialectal Arabic translation, establishing Alexandria as a high-quality benchmark for future research in modeling, training, and evaluating systems for this linguistically complex setting.
📝 Abstract
Arabic is a highly diglossic language where most daily communication occurs in regional dialects rather than Modern Standard Arabic. Despite this, machine translation (MT) systems often generalize poorly to dialectal input, limiting their utility for millions of speakers. We introduce \textbf{Alexandria}, a large-scale, community-driven, human-translated dataset designed to bridge this gap. Alexandria covers 13 Arab countries and 11 high-impact domains, including health, education, and agriculture. Unlike previous resources, Alexandria provides unprecedented granularity by associating contributions with city-of-origin metadata, capturing authentic local varieties beyond coarse regional labels. The dataset consists of multi-turn conversational scenarios annotated with speaker-addressee gender configurations, enabling the study of gender-conditioned variation in dialectal use. Comprising 107K total samples, Alexandria serves as both a training resource and a rigorous benchmark for evaluating MT and Large Language Models (LLMs). Our automatic and human evaluation of Arabic-aware LLMs benchmarks current capabilities in translating across diverse Arabic dialects and sub-dialects, while exposing significant persistent challenges.