When LLMs Struggle: Reference-less Translation Evaluation for Low-resource Languages

📅 2025-01-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses reference-free machine translation quality estimation (QE) for low-resource language pairs, targeting segment-level quality scores on a 0–100 scale. To overcome the absence of human reference translations, we systematically compare three paradigms: zero- and few-shot prompting of large language models (LLMs), instruction fine-tuning, and encoder architecture fine-tuning—introducing a structured, guideline-based instruction prompt. Experimental results show that encoder fine-tuning significantly outperforms all LLM prompting variants. Tokenization, transliteration, and named entity handling emerge as critical bottlenecks for LLMs in QE. Error analysis and cross-lingual pretraining diagnostics further expose limitations in cross-lingual semantic understanding. We release the first benchmark dataset for low-resource QE alongside corresponding fine-tuned models, establishing a new paradigm and empirical foundation for trustworthy QE in resource-scarce settings. (149 words)

Technology Category

Application Category

📝 Abstract
This paper investigates the reference-less evaluation of machine translation for low-resource language pairs, known as quality estimation (QE). Segment-level QE is a challenging cross-lingual language understanding task that provides a quality score (0-100) to the translated output. We comprehensively evaluate large language models (LLMs) in zero/few-shot scenarios and perform instruction fine-tuning using a novel prompt based on annotation guidelines. Our results indicate that prompt-based approaches are outperformed by the encoder-based fine-tuned QE models. Our error analysis reveals tokenization issues, along with errors due to transliteration and named entities, and argues for refinement in LLM pre-training for cross-lingual tasks. We release the data, and models trained publicly for further research.
Problem

Research questions and friction points this paper is trying to address.

Language Models
Translation Quality Estimation
Resource-scarce Language Pairs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Resource-scarce Language Pairs
Large Language Model Training
Translation Quality Evaluation
🔎 Similar Papers
No similar papers found.