Don't Sweat the Small Stuff: Segment-Level Meta-Evaluation Based on Pairwise Difference Correlation

📅 2025-09-29

📈 Citations: 0

✨ Influential: 0

career value

180K/year

🤖 AI Summary

Existing segment-level meta-evaluation metrics for machine translation—such as Pearson’s ρ and Kendall’s τ—are sensitive to noise and lack robustness. To address this, we propose PDP (Pairwise Difference Pearson), a novel segment-level meta-evaluation metric that operates on pairwise score differences between segments rather than raw human or system scores. PDP reformulates global Pearson correlation as intra-segment pairwise comparisons, thereby aggregating information across all segments to enhance distributional modeling capacity. Empirical evaluation on the WMT’24 shared task shows that PDP achieves significantly higher agreement with human judgments, more accurately ranks sentinel metrics, and better fits human-annotated error weights. Moreover, PDP demonstrates superior robustness under diverse perturbations—including random noise, segment-level bias, and system-level bias. This work introduces, for the first time, difference-driven correlation modeling into MT meta-evaluation, establishing a new paradigm for robust and interpretable segment-level quality estimation.

Technology Category

Application Category

📝 Abstract

This paper introduces Pairwise Difference Pearson (PDP), a novel segment-level meta-evaluation metric for Machine Translation (MT) that address limitations in previous Pearson's $ρ$-based and and Kendall's $τ$-based meta-evaluation approaches. PDP is a correlation-based metric that utilizes pairwise differences rather than raw scores. It draws on information from all segments for a more robust understanding of score distributions and uses segment-wise pairwise differences to refine Global Pearson to intra-segment score comparisons. Analysis on the WMT'24 shared task shows PDP properly ranks sentinel evaluation metrics and better aligns with human error weightings than previous work. Noise injection analysis demonstrates PDP's robustness to random noise, segment bias, and system bias while highlighting its sensitivity to extreme outliers.

Problem

Research questions and friction points this paper is trying to address.

Addressing limitations in segment-level meta-evaluation metrics for Machine Translation

Proposing correlation-based metric using pairwise differences instead of raw scores

Improving robustness to noise and alignment with human error weightings

Innovation

Methods, ideas, or system contributions that make the work stand out.

Pairwise Difference Pearson metric for MT evaluation

Uses segment-wise pairwise differences instead of raw scores

Robust to noise and aligns with human error weightings

🔎 Similar Papers

Surveying the Landscape of Image Captioning Evaluation: A Comprehensive Taxonomy, Trends and Metrics Analysis

2024-08-09Citations: 0

Apple

Seattle, United States of America

Research Scientist Intern, Multimodal AI (PhD)