Long-context Reference-based MT Quality Estimation

📅 2025-09-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Short-context modeling in Machine Translation Quality Estimation (QE) yields insufficient correlation with human judgments. Method: We propose a multilingual regression framework incorporating long-context information: (i) constructing tripartite long-context training data comprising source, translation, and reference; (ii) unifying MQM, SQM, and DA annotation schemes via a weighted-average label synthesis strategy; and (iii) extending the COMET architecture with sentence concatenation and score normalization to predict Error Span Annotation (ESA) scores at the segment level. Results: Our approach achieves statistically significant improvements in correlation with human ratings across multiple benchmarks—e.g., an average +0.12 gain in Pearson correlation coefficient—outperforming short-context baselines. This demonstrates that explicit long-context modeling delivers critical gains in QE accuracy.

Technology Category

Application Category

📝 Abstract
In this paper, we present our submission to the Tenth Conference on Machine Translation (WMT25) Shared Task on Automated Translation Quality Evaluation. Our systems are built upon the COMET framework and trained to predict segment-level Error Span Annotation (ESA) scores using augmented long-context data. To construct long-context training data, we concatenate in-domain, human-annotated sentences and compute a weighted average of their scores. We integrate multiple human judgment datasets (MQM, SQM, and DA) by normalising their scales and train multilingual regression models to predict quality scores from the source, hypothesis, and reference translations. Experimental results show that incorporating long-context information improves correlations with human judgments compared to models trained only on short segments.
Problem

Research questions and friction points this paper is trying to address.

Predicting translation quality scores using long-context data
Integrating multiple human judgment datasets via normalization
Improving correlation with human judgments through context augmentation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Long-context augmented training data
Multilingual regression models integration
Error Span Annotation score prediction
🔎 Similar Papers
No similar papers found.
S
Sami Ul Haq
ADAPT Centre, Dublin City University, Dublin, Ireland
C
Chinonso Cynthia Osuji
ADAPT Centre, Dublin City University, Dublin, Ireland
Sheila Castilho
Sheila Castilho
SALIS/ADAPT Centre - Dublin City University
machine translationMT evaluationNLP
B
Brian Davis
ADAPT Centre, Dublin City University, Dublin, Ireland