🤖 AI Summary
This paper addresses the pervasive system dependency problem in machine translation (MT) automatic evaluation metrics—where metric scores are biased by idiosyncratic characteristics of specific MT systems, leading to unreliable and unfair cross-system comparisons. We propose the first formal, quantitative framework for measuring system dependency. Our methodology integrates rank stability analysis, cross-system score distribution comparison, Monte Carlo perturbation experiments, and statistical significance testing to enable reproducible, diagnostic assessment of metric bias. Empirical validation on WMT benchmarks reveals statistically significant system dependency in BLEU, COMET, and BERTScore. Beyond exposing latent fairness deficiencies in widely adopted metrics, this work establishes a new benchmark for fair evaluation and releases an open-source diagnostic toolkit. It provides both theoretical foundations and practical guidance for metric design, selection, and refinement—advancing rigor and equity in MT evaluation.
📝 Abstract
Automated metrics for Machine Translation have made significant progress, with the goal of replacing expensive and time-consuming human evaluations. These metrics are typically assessed by their correlation with human judgments, which captures the monotonic relationship between human and metric scores. However, we argue that it is equally important to ensure that metrics treat all systems fairly and consistently. In this paper, we introduce a method to evaluate this aspect.