🤖 AI Summary
Arabic exhibits a diglossic situation between Modern Standard Arabic (MSA) and regional dialects, with the low-resource Syrian Arabic (Shami) dialect posing significant challenges for machine translation. To address this, we propose a dedicated dual-model architecture for bidirectional MSA–Shami translation, built upon AraT5v2-base-1024 and fine-tuned on the Nabra dataset, with evaluation conducted on the MADAR corpus. Our key contribution is the first end-to-end, high-fidelity, and nativelike bidirectional translation system for Shami↔MSA, bridging a critical gap in low-resource dialectal MT. Automatic evaluation augmented by human assessment using GPT-4.1 yields a score of 4.01/5.0 for MSA→Shami translation—substantially outperforming baselines—and confirms the model’s effectiveness in grammatical adaptation, pragmatic naturalness, and preservation of dialect-specific features.
📝 Abstract
The rich linguistic landscape of the Arab world is characterized by a significant gap between Modern Standard Arabic (MSA), the language of formal communication, and the diverse regional dialects used in everyday life. This diglossia presents a formidable challenge for natural language processing, particularly machine translation. This paper introduces extbf{SHAMI-MT}, a bidirectional machine translation system specifically engineered to bridge the communication gap between MSA and the Syrian dialect. We present two specialized models, one for MSA-to-Shami and another for Shami-to-MSA translation, both built upon the state-of-the-art AraT5v2-base-1024 architecture. The models were fine-tuned on the comprehensive Nabra dataset and rigorously evaluated on unseen data from the MADAR corpus. Our MSA-to-Shami model achieved an outstanding average quality score of extbf{4.01 out of 5.0} when judged by OPENAI model GPT-4.1, demonstrating its ability to produce translations that are not only accurate but also dialectally authentic. This work provides a crucial, high-fidelity tool for a previously underserved language pair, advancing the field of dialectal Arabic translation and offering significant applications in content localization, cultural heritage, and intercultural communication.