🤖 AI Summary
This study addresses reference-free machine translation quality estimation (QE) for low-resource language pairs, targeting segment-level quality scores on a 0–100 scale. To overcome the absence of human reference translations, we systematically compare three paradigms: zero- and few-shot prompting of large language models (LLMs), instruction fine-tuning, and encoder architecture fine-tuning—introducing a structured, guideline-based instruction prompt. Experimental results show that encoder fine-tuning significantly outperforms all LLM prompting variants. Tokenization, transliteration, and named entity handling emerge as critical bottlenecks for LLMs in QE. Error analysis and cross-lingual pretraining diagnostics further expose limitations in cross-lingual semantic understanding. We release the first benchmark dataset for low-resource QE alongside corresponding fine-tuned models, establishing a new paradigm and empirical foundation for trustworthy QE in resource-scarce settings. (149 words)
📝 Abstract
This paper investigates the reference-less evaluation of machine translation for low-resource language pairs, known as quality estimation (QE). Segment-level QE is a challenging cross-lingual language understanding task that provides a quality score (0-100) to the translated output. We comprehensively evaluate large language models (LLMs) in zero/few-shot scenarios and perform instruction fine-tuning using a novel prompt based on annotation guidelines. Our results indicate that prompt-based approaches are outperformed by the encoder-based fine-tuned QE models. Our error analysis reveals tokenization issues, along with errors due to transliteration and named entities, and argues for refinement in LLM pre-training for cross-lingual tasks. We release the data, and models trained publicly for further research.