SSA-COMET: Do LLMs Outperform Learned Metrics in Evaluating MT for Under-Resourced African Languages?

📅 2025-06-05

📈 Citations: 0

✨ Influential: 0

career value

189K/year

🤖 AI Summary

Addressing the challenge of evaluating machine translation (MT) quality for low-resource African languages, this paper introduces SSA-MTE—the first large-scale, human-annotated MT evaluation benchmark for Sub-Saharan Africa, covering 13 language pairs and over 63,000 annotations. Building on SSA-MTE, we propose two evaluation models: reference-based SSA-COMET and reference-free SSA-COMET-QE. Our approach innovatively integrates contrastive learning with XLM-RoBERTa/DeBERTa, multi-task joint training, and zero-/few-shot prompt engineering to suit extremely low-resource settings. Experiments show that SSA-COMET significantly outperforms AfriCOMET on Twi, Luo, and Yoruba, achieving performance comparable to Gemini 2.5 Pro; however, mainstream closed-source LLMs (e.g., GPT-4o, Claude, Gemini) do not yet consistently surpass lightweight learned metrics. All data and models are publicly released to advance fairness, reproducibility, and progress in African-language NLP.

Technology Category

Application Category

📝 Abstract

Evaluating machine translation (MT) quality for under-resourced African languages remains a significant challenge, as existing metrics often suffer from limited language coverage and poor performance in low-resource settings. While recent efforts, such as AfriCOMET, have addressed some of the issues, they are still constrained by small evaluation sets, a lack of publicly available training data tailored to African languages, and inconsistent performance in extremely low-resource scenarios. In this work, we introduce SSA-MTE, a large-scale human-annotated MT evaluation (MTE) dataset covering 13 African language pairs from the News domain, with over 63,000 sentence-level annotations from a diverse set of MT systems. Based on this data, we develop SSA-COMET and SSA-COMET-QE, improved reference-based and reference-free evaluation metrics. We also benchmark prompting-based approaches using state-of-the-art LLMs like GPT-4o and Claude. Our experimental results show that SSA-COMET models significantly outperform AfriCOMET and are competitive with the strongest LLM (Gemini 2.5 Pro) evaluated in our study, particularly on low-resource languages such as Twi, Luo, and Yoruba. All resources are released under open licenses to support future research.

Problem

Research questions and friction points this paper is trying to address.

Evaluating MT quality for under-resourced African languages

Limited language coverage of existing MT evaluation metrics

Lack of large-scale African language MTE datasets

Innovation

Methods, ideas, or system contributions that make the work stand out.

Large-scale human-annotated dataset for African languages

Improved reference-based and reference-free evaluation metrics

Benchmarked prompting-based approaches with state-of-the-art LLMs

🔎 Similar Papers

AfroBench: How Good are Large Language Models on African Languages?

2023-11-14Citations: 15