🤖 AI Summary
Machine translation from Dialectal Arabic (DA) to Modern Standard Arabic (MSA) suffers from lexical, syntactic, and semantic divergences; existing automatic metrics and generic human evaluation approaches fail to detect dialect-specific errors. To address this, we propose the first human-centered post-editing evaluation framework tailored for DA→MSA translation. Our method introduces a five-category error taxonomy and a decision-tree-based structured annotation protocol to systematically model dialectal term mapping and semantic fidelity. It integrates human post-editing analysis, error attribution taxonomy development, and cross-system comparative evaluation. Experimental results—based on rigorous post-editing of outputs from Jais, GPT-3.5, and NLLB-200—demonstrate statistically significant performance differentiation among these systems for the first time. Crucially, the evaluation reveals that inaccurate dialectal term translation and poor semantic consistency constitute the primary bottlenecks in current DA→MSA MT systems.
📝 Abstract
Dialectal Arabic to Modern Standard Arabic (DA-MSA) translation is a challenging task in Machine Translation (MT) due to significant lexical, syntactic, and semantic divergences between Arabic dialects and MSA. Existing automatic evaluation metrics and general-purpose human evaluation frameworks struggle to capture dialect-specific MT errors, hindering progress in translation assessment. This paper introduces Ara-HOPE, a human-centric post-editing evaluation framework designed to systematically address these challenges. The framework includes a five-category error taxonomy and a decision-tree annotation protocol. Through comparative evaluation of three MT systems (Arabic-centric Jais, general-purpose GPT-3.5, and baseline NLLB-200), Ara-HOPE effectively highlights systematic performance differences between these systems. The results show that dialect-specific terminology and semantic preservation remain the most persistent challenges in DA-MSA translation. Ara-HOPE establishes a new framework for evaluating Dialectal Arabic MT quality and provides actionable guidance for improving dialect-aware MT systems.