Enhancing Human Evaluation in Machine Translation with Comparative Judgment

📅 2025-02-25

📈 Citations: 0

✨ Influential: 0

career value

181K/year

🤖 AI Summary

This study addresses low accuracy and inter-annotator inconsistency in human evaluation of machine translation (MT), stemming from annotator capability variation and task design bias. We propose a paradigm shift from traditional pointwise annotation to pairwise side-by-side (SxS) comparison, grounded in the Multidimensional Quality Metrics (MQM) framework. Through systematic comparison of MQM, SxS-MQM, and SxS relative ranking (RR), we empirically demonstrate—for the first time—that SxS significantly improves inter-annotator agreement (average 19.5–38.5% gain in error-label consistency) and cross-system error detection stability, especially for subtle semantic and stylistic deviations often missed by MQM. All settings preserve system-level ranking stability, while SxS-RR achieves the optimal trade-off between evaluation efficiency and reliability. We publicly release a triple-annotated dataset comprising 377 Chinese–English and 104 English–German sentence pairs, establishing a new benchmark and reproducible resource for MT evaluation.

Technology Category

Application Category

📝 Abstract

Human evaluation is crucial for assessing rapidly evolving language models but is influenced by annotator proficiency and task design. This study explores the integration of comparative judgment into human annotation for machine translation (MT) and evaluates three annotation setups-point-wise Multidimensional Quality Metrics (MQM), side-by-side (SxS) MQM, and its simplified version SxS relative ranking (RR). In MQM, annotators mark error spans with categories and severity levels. SxS MQM extends MQM to pairwise error annotation for two translations of the same input, while SxS RR focuses on selecting the better output without labeling errors. Key findings are: (1) the SxS settings achieve higher inter-annotator agreement than MQM; (2) SxS MQM enhances inter-translation error marking consistency compared to MQM by, on average, 38.5% for explicitly compared MT systems and 19.5% for others; (3) all annotation settings return stable system rankings, with SxS RR offering a more efficient alternative to (SxS) MQM; (4) the SxS settings highlight subtle errors overlooked in MQM without altering absolute system evaluations. To spur further research, we will release the triply annotated datasets comprising 377 ZhEn and 104 EnDe annotation examples.

Problem

Research questions and friction points this paper is trying to address.

Improving human evaluation in machine translation.

Comparing annotation methods for translation quality.

Enhancing inter-annotator agreement and error consistency.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Comparative judgment in MT evaluation

Side-by-side Multidimensional Quality Metrics

Simplified relative ranking for efficiency

🔎 Similar Papers

No similar papers found.