🤖 AI Summary
The reliability of automatic metrics—commonly benchmarked against human judgments—is increasingly questioned, yet no systematic evaluation has established human annotators themselves as a meta-evaluation baseline. Method: This work introduces the first comprehensive framework that treats human annotators as the ground truth for meta-assessment, integrating correlation analysis and inter-annotator agreement measurement across multiple dimensions. Contribution/Results: Experiments reveal that several state-of-the-art automatic metrics achieve inter-annotator agreement levels comparable to—or even exceeding—those of human annotators on specific machine translation tasks, thereby challenging the long-standing assumption that human judgments are inherently optimal. The study identifies fundamental measurement bottlenecks and reliability limits in current MT evaluation paradigms, establishing a new empirical benchmark and methodological paradigm for validating the trustworthiness of automated evaluation metrics.
📝 Abstract
In Machine Translation (MT) evaluation, metric performance is assessed based on agreement with human judgments. In recent years, automatic metrics have demonstrated increasingly high levels of agreement with humans. To gain a clearer understanding of metric performance and establish an upper bound, we incorporate human baselines in the MT meta-evaluation, that is, the assessment of MT metrics' capabilities. Our results show that human annotators are not consistently superior to automatic metrics, with state-of-the-art metrics often ranking on par with or higher than human baselines. Despite these findings suggesting human parity, we discuss several reasons for caution. Finally, we explore the broader implications of our results for the research field, asking: Can we still reliably measure improvements in MT evaluation? With this work, we aim to shed light on the limits of our ability to measure progress in the field, fostering discussion on an issue that we believe is crucial to the entire MT evaluation community.