Translation or Recitation? Calibrating Evaluation Scores for Machine Translation of Extremely Low-Resource Languages

📅 2026-03-26

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

This work addresses the challenge of incomparable evaluation results in extremely low-resource machine translation, where performance gains are often confounded by dataset-specific biases rather than genuine model improvements. To this end, the authors propose FRED—a novel difficulty metric framework that quantifies intrinsic dataset complexity along four dimensions: token fertility ratio, retrieval proxy, pre-training exposure, and corpus diversity. The framework reveals that train-test overlap and pre-training coverage are primary drivers of performance disparities, while also identifying high fertility as a fundamental bottleneck for translating non-Latin and extinct languages. By providing a transparent and standardized benchmark, FRED substantially enhances the interpretability and comparability of evaluations in extremely low-resource MT research.

Technology Category

Application Category

📝 Abstract

The landscape of extremely low-resource machine translation (MT) is characterized by perplexing variability in reported performance, often making results across different language pairs difficult to contextualize. For researchers focused on specific language groups -- such as ancient languages -- it is nearly impossible to determine if breakthroughs reported in other contexts (e.g., native African or American languages) result from superior methodologies or are merely artifacts of benchmark collection. To address this problem, we introduce the FRED Difficulty Metrics, which include the Fertility Ratio (F), Retrieval Proxy (R), Pre-training Exposure (E), and Corpus Diversity (D) and serve as dataset-intrinsic metrics to contextualize reported scores. These metrics reveal that a significant portion of result variability is explained by train-test overlap and pre-training exposure rather than model capability. Additionally, we identify that some languages -- particularly extinct and non-Latin indigenous languages -- suffer from poor tokenization coverage (high token fertility), highlighting a fundamental limitation of transferring models from high-resource languages that lack a shared vocabulary. By providing these indices alongside performance scores, we enable more transparent evaluation of cross-lingual transfer and provide a more reliable foundation for the XLR MT community.

Problem

Research questions and friction points this paper is trying to address.

extremely low-resource machine translation

evaluation variability

train-test overlap

pre-training exposure

tokenization coverage

Innovation

Methods, ideas, or system contributions that make the work stand out.

FRED Difficulty Metrics

extremely low-resource MT

token fertility