π€ AI Summary
Addressing the challenge of evaluating machine translation (MT) quality for low-resource African languages, this paper introduces SSA-MTEβthe first large-scale, human-annotated MT evaluation benchmark for Sub-Saharan Africa, covering 13 language pairs and over 63,000 annotations. Building on SSA-MTE, we propose two evaluation models: reference-based SSA-COMET and reference-free SSA-COMET-QE. Our approach innovatively integrates contrastive learning with XLM-RoBERTa/DeBERTa, multi-task joint training, and zero-/few-shot prompt engineering to suit extremely low-resource settings. Experiments show that SSA-COMET significantly outperforms AfriCOMET on Twi, Luo, and Yoruba, achieving performance comparable to Gemini 2.5 Pro; however, mainstream closed-source LLMs (e.g., GPT-4o, Claude, Gemini) do not yet consistently surpass lightweight learned metrics. All data and models are publicly released to advance fairness, reproducibility, and progress in African-language NLP.
π Abstract
Evaluating machine translation (MT) quality for under-resourced African languages remains a significant challenge, as existing metrics often suffer from limited language coverage and poor performance in low-resource settings. While recent efforts, such as AfriCOMET, have addressed some of the issues, they are still constrained by small evaluation sets, a lack of publicly available training data tailored to African languages, and inconsistent performance in extremely low-resource scenarios. In this work, we introduce SSA-MTE, a large-scale human-annotated MT evaluation (MTE) dataset covering 13 African language pairs from the News domain, with over 63,000 sentence-level annotations from a diverse set of MT systems. Based on this data, we develop SSA-COMET and SSA-COMET-QE, improved reference-based and reference-free evaluation metrics. We also benchmark prompting-based approaches using state-of-the-art LLMs like GPT-4o and Claude. Our experimental results show that SSA-COMET models significantly outperform AfriCOMET and are competitive with the strongest LLM (Gemini 2.5 Pro) evaluated in our study, particularly on low-resource languages such as Twi, Luo, and Yoruba. All resources are released under open licenses to support future research.