Who Watches the Watchmen? Humans Disagree With Translation Metrics on Unseen Domains

📅 2026-04-19

📈 Citations: 0

✨ Influential: 0

career value

200K/year

🤖 AI Summary

This study addresses the overestimated robustness of existing automatic machine translation evaluation metrics in unseen domains and their inability to disentangle domain transfer effects from human annotation noise. By employing a controlled design with fixed translators and multiple annotators, the authors systematically evaluate six translation systems across news (seen) and two technical (unseen) domains, introducing CD-ESA—the first cross-domain error span annotation dataset. Experimental results reveal that while automatic metrics exhibit seemingly moderate segment-level agreement with human judgments (up to 0.69 Pearson correlation), this consistency drops substantially when accounting for annotator variability. Notably, in the chemistry domain, metric-human agreement (0.78–0.83) falls far short of inter-annotator agreement (0.96), thereby uncovering—under controlled conditions—the true limitations of current automatic metrics’ robustness in domain-shifted scenarios.

Technology Category

Application Category

📝 Abstract

Automatic evaluation metrics are central to the development of machine translation systems, yet their robustness under domain shift remains unclear. Most metrics are developed on the Workshop on Machine Translation (WMT) benchmarks, raising concerns about their robustness to unseen domains. Prior studies that analyze unseen domains vary translation systems, annotators, or evaluation conditions, confounding domain effects with human annotation noise. To address these biases, we introduce a systematic multi-annotator Cross-Domain Error-Span-Annotation dataset (CD-ESA), comprising 18.8k human error span annotations across three language pairs, where we fix annotators within each language pair and evaluate translations of the same six translation systems across one seen news domain and two unseen technical domains. Using this dataset, we first find that automatic metrics appear surprisingly robust to domain-shifts at the segment level (up to 0.69 agreement), but this robustness largely disappears once we account for human label variation. Averaging annotations increases inter-annotator agreement by up to +0.11. Metrics struggle on the unseen chemical domain compared to humans (inter-annotator agreement of 0.78-0.83 vs. 0.96). We recommend comparing metric-human agreement against inter-annotator agreement, rather than comparing raw metric-human agreement alone, when evaluating across different domains.

Problem

Research questions and friction points this paper is trying to address.

machine translation

evaluation metrics

domain shift

human annotation

robustness

Innovation

Methods, ideas, or system contributions that make the work stand out.

cross-domain evaluation

error span annotation

machine translation metrics

inter-annotator agreement

domain shift

🔎 Similar Papers

No similar papers found.

Apple

Seattle, United States of America

Authors to Follow