🤖 AI Summary
This study addresses the overestimated robustness of existing automatic machine translation evaluation metrics in unseen domains and their inability to disentangle domain transfer effects from human annotation noise. By employing a controlled design with fixed translators and multiple annotators, the authors systematically evaluate six translation systems across news (seen) and two technical (unseen) domains, introducing CD-ESA—the first cross-domain error span annotation dataset. Experimental results reveal that while automatic metrics exhibit seemingly moderate segment-level agreement with human judgments (up to 0.69 Pearson correlation), this consistency drops substantially when accounting for annotator variability. Notably, in the chemistry domain, metric-human agreement (0.78–0.83) falls far short of inter-annotator agreement (0.96), thereby uncovering—under controlled conditions—the true limitations of current automatic metrics’ robustness in domain-shifted scenarios.
📝 Abstract
Automatic evaluation metrics are central to the development of machine translation systems, yet their robustness under domain shift remains unclear. Most metrics are developed on the Workshop on Machine Translation (WMT) benchmarks, raising concerns about their robustness to unseen domains. Prior studies that analyze unseen domains vary translation systems, annotators, or evaluation conditions, confounding domain effects with human annotation noise.
To address these biases, we introduce a systematic multi-annotator Cross-Domain Error-Span-Annotation dataset (CD-ESA), comprising 18.8k human error span annotations across three language pairs, where we fix annotators within each language pair and evaluate translations of the same six translation systems across one seen news domain and two unseen technical domains. Using this dataset, we first find that automatic metrics appear surprisingly robust to domain-shifts at the segment level (up to 0.69 agreement), but this robustness largely disappears once we account for human label variation. Averaging annotations increases inter-annotator agreement by up to +0.11. Metrics struggle on the unseen chemical domain compared to humans (inter-annotator agreement of 0.78-0.83 vs. 0.96).
We recommend comparing metric-human agreement against inter-annotator agreement, rather than comparing raw metric-human agreement alone, when evaluating across different domains.