Don't Sweat the Small Stuff: Segment-Level Meta-Evaluation Based on Pairwise Difference Correlation

📅 2025-09-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing segment-level meta-evaluation metrics for machine translation—such as Pearson’s ρ and Kendall’s τ—are sensitive to noise and lack robustness. To address this, we propose PDP (Pairwise Difference Pearson), a novel segment-level meta-evaluation metric that operates on pairwise score differences between segments rather than raw human or system scores. PDP reformulates global Pearson correlation as intra-segment pairwise comparisons, thereby aggregating information across all segments to enhance distributional modeling capacity. Empirical evaluation on the WMT’24 shared task shows that PDP achieves significantly higher agreement with human judgments, more accurately ranks sentinel metrics, and better fits human-annotated error weights. Moreover, PDP demonstrates superior robustness under diverse perturbations—including random noise, segment-level bias, and system-level bias. This work introduces, for the first time, difference-driven correlation modeling into MT meta-evaluation, establishing a new paradigm for robust and interpretable segment-level quality estimation.

Technology Category

Application Category

📝 Abstract
This paper introduces Pairwise Difference Pearson (PDP), a novel segment-level meta-evaluation metric for Machine Translation (MT) that address limitations in previous Pearson's $ρ$-based and and Kendall's $τ$-based meta-evaluation approaches. PDP is a correlation-based metric that utilizes pairwise differences rather than raw scores. It draws on information from all segments for a more robust understanding of score distributions and uses segment-wise pairwise differences to refine Global Pearson to intra-segment score comparisons. Analysis on the WMT'24 shared task shows PDP properly ranks sentinel evaluation metrics and better aligns with human error weightings than previous work. Noise injection analysis demonstrates PDP's robustness to random noise, segment bias, and system bias while highlighting its sensitivity to extreme outliers.
Problem

Research questions and friction points this paper is trying to address.

Addressing limitations in segment-level meta-evaluation metrics for Machine Translation
Proposing correlation-based metric using pairwise differences instead of raw scores
Improving robustness to noise and alignment with human error weightings
Innovation

Methods, ideas, or system contributions that make the work stand out.

Pairwise Difference Pearson metric for MT evaluation
Uses segment-wise pairwise differences instead of raw scores
Robust to noise and aligns with human error weightings
🔎 Similar Papers
No similar papers found.