A Measure of the System Dependence of Automated Metrics

📅 2024-12-04

📈 Citations: 0

✨ Influential: 0

career value

194K/year

🤖 AI Summary

This paper addresses the pervasive system dependency problem in machine translation (MT) automatic evaluation metrics—where metric scores are biased by idiosyncratic characteristics of specific MT systems, leading to unreliable and unfair cross-system comparisons. We propose the first formal, quantitative framework for measuring system dependency. Our methodology integrates rank stability analysis, cross-system score distribution comparison, Monte Carlo perturbation experiments, and statistical significance testing to enable reproducible, diagnostic assessment of metric bias. Empirical validation on WMT benchmarks reveals statistically significant system dependency in BLEU, COMET, and BERTScore. Beyond exposing latent fairness deficiencies in widely adopted metrics, this work establishes a new benchmark for fair evaluation and releases an open-source diagnostic toolkit. It provides both theoretical foundations and practical guidance for metric design, selection, and refinement—advancing rigor and equity in MT evaluation.

Technology Category

Application Category

📝 Abstract

Automated metrics for Machine Translation have made significant progress, with the goal of replacing expensive and time-consuming human evaluations. These metrics are typically assessed by their correlation with human judgments, which captures the monotonic relationship between human and metric scores. However, we argue that it is equally important to ensure that metrics treat all systems fairly and consistently. In this paper, we introduce a method to evaluate this aspect.

Problem

Research questions and friction points this paper is trying to address.

Machine Translation

Fairness Evaluation

Automated Scoring

Innovation

Methods, ideas, or system contributions that make the work stand out.

Machine Translation Evaluation

Fairness Assessment

Automated Scoring Methods

🔎 Similar Papers

CRScore: Grounding Automated Evaluation of Code Review Comments in Code Claims and Smells