A Measure of the System Dependence of Automated Metrics

📅 2024-12-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses the pervasive system dependency problem in machine translation (MT) automatic evaluation metrics—where metric scores are biased by idiosyncratic characteristics of specific MT systems, leading to unreliable and unfair cross-system comparisons. We propose the first formal, quantitative framework for measuring system dependency. Our methodology integrates rank stability analysis, cross-system score distribution comparison, Monte Carlo perturbation experiments, and statistical significance testing to enable reproducible, diagnostic assessment of metric bias. Empirical validation on WMT benchmarks reveals statistically significant system dependency in BLEU, COMET, and BERTScore. Beyond exposing latent fairness deficiencies in widely adopted metrics, this work establishes a new benchmark for fair evaluation and releases an open-source diagnostic toolkit. It provides both theoretical foundations and practical guidance for metric design, selection, and refinement—advancing rigor and equity in MT evaluation.

Technology Category

Application Category

📝 Abstract
Automated metrics for Machine Translation have made significant progress, with the goal of replacing expensive and time-consuming human evaluations. These metrics are typically assessed by their correlation with human judgments, which captures the monotonic relationship between human and metric scores. However, we argue that it is equally important to ensure that metrics treat all systems fairly and consistently. In this paper, we introduce a method to evaluate this aspect.
Problem

Research questions and friction points this paper is trying to address.

Machine Translation
Fairness Evaluation
Automated Scoring
Innovation

Methods, ideas, or system contributions that make the work stand out.

Machine Translation Evaluation
Fairness Assessment
Automated Scoring Methods
🔎 Similar Papers
No similar papers found.