π€ AI Summary
This study addresses the limitations of conventional BLEU scores in evaluating machine translation quality for extremely low-resource languages such as Magahi, Bhojpuri, and Chhattisgarhi. Through systematic empirical analysis of outputs from both neural machine translation (NMT) systems and large language models (LLMs), the work investigates the sensitivity of character-level metric ChrF++ and n-gramβbased BLEU to common pathologies including hallucination, repetition, source copying, and diacritic variation. The findings reveal that although BLEU tends to yield lower scores in low-resource settings, its capacity to capture lexical precision effectively complements the strengths of ChrF++. Jointly leveraging both metrics substantially enhances the comprehensiveness and interpretability of translation quality assessment, offering a novel paradigm for evaluation in low-resource scenarios.
π Abstract
Evaluating machine translation (MT) quality in extremely low-resource language (ELRL) scenarios poses unique challenges, as widely used metrics such as BLEU, effective in high-resource settings, often misrepresent quality in data-scarce contexts. This work presents a comparative analysis of BLEU, an n-gram-based metric, and ChrF++, a character-based metric, for MT evaluation in ELRL settings. We examine how each metric responds to translation artifacts, including hallucinations, repetition, source-text copying, and diacritic (\textit{matra}) variations across three ELRLs: Magahi, Bhojpuri, and Chhattisgarhi, with a focus on outputs from large language models (LLMs) and neural MT (NMT) systems. While recent work often relies solely on ChrF++, our findings show that BLEU, despite its lower absolute scores, provides complementary lexical-precision insights that improve interpretability.