🤖 AI Summary
This study addresses the absence of publicly available benchmark datasets for English–Hebrew machine translation quality estimation (MTQE) by constructing and releasing MTQE.en-he, the first such resource, comprising 959 English paragraphs along with their Hebrew machine translations and human-assigned quality scores. Building upon this dataset, the work systematically evaluates a range of quality estimation algorithms and introduces a multi-model ensemble strategy that significantly outperforms existing approaches. Furthermore, it validates the effectiveness of parameter-efficient fine-tuning techniques—including LoRA and BitFit—for low-resource language pairs. Experimental results demonstrate that the proposed ensemble model improves Pearson and Spearman correlation coefficients by 6.4% and 5.6%, respectively, over the best single model (CometKiwi), while efficient fine-tuning methods consistently yield performance gains of 2–3 percentage points.
📝 Abstract
We release MTQE.en-he: to our knowledge, the first publicly available English-Hebrew benchmark for Machine Translation Quality Estimation. MTQE.en-he contains 959 English segments from WMT24++, each paired with a machine translation into Hebrew, and Direct Assessment scores of the translation quality annotated by three human experts. We benchmark ChatGPT prompting, TransQuest, and CometKiwi and show that ensembling the three models outperforms the best single model (CometKiwi) by 6.4 percentage points Pearson and 5.6 percentage points Spearman. Fine-tuning experiments with TransQuest and CometKiwi reveal that full-model updates are sensitive to overfitting and distribution collapse, yet parameter-efficient methods (LoRA, BitFit, and FTHead, i.e., fine-tuning only the classification head) train stably and yield improvements of 2-3 percentage points. MTQE.en-he and our experimental results enable future research on this under-resourced language pair.