Sentence-level Aggregation of Lexical Metrics Correlate Stronger with Human Judgements than Corpus-level Aggregation

📅 2024-07-03

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

209K/year

🤖 AI Summary

Existing lexical evaluation metrics (e.g., BLEU, chrF) exhibit low correlation with human judgments and poor stability in document-level machine translation (MT) evaluation—particularly for low-resource languages. Method: This paper systematically investigates the impact of aggregation strategies, comparing corpus-level averaging against sentence-level averaging. Through Kendall τ correlation analysis, cross-metric evaluation (BLEU/chrF/COMET/BLEURT), and empirical statistical validation, we mathematically characterize the fundamental distinction between “average of ratios” and “ratio of averages,” elucidating its critical implications for statistical robustness and human correlation. Contribution/Results: Sentence-level aggregation boosts BLEU/chrF correlation with human scores by 30–50%, markedly improving reliability in low-resource settings. It also aligns lexical metrics’ behavior more closely with neural metrics, establishing a lightweight, trustworthy paradigm for automatic MT evaluation.

Technology Category

Application Category

📝 Abstract

In this paper we show that corpus-level aggregation hinders considerably the capability of lexical metrics to accurately evaluate machine translation (MT) systems. With empirical experiments we demonstrate that averaging individual segment-level scores can make metrics such as BLEU and chrF correlate much stronger with human judgements and make them behave considerably more similar to neural metrics such as COMET and BLEURT. We show that this difference exists because corpus- and segment-level aggregation differs considerably owing to the classical average of ratio versus ratio of averages Mathematical problem. Moreover, as we also show, such difference affects considerably the statistical robustness of corpus-level aggregation. Considering that neural metrics currently only cover a small set of sufficiently-resourced languages, the results in this paper can help make the evaluation of MT systems for low-resource languages more trustworthy.

Problem

Research questions and friction points this paper is trying to address.

Machine Translation Evaluation

Accuracy and Stability

Low-resource Languages

Innovation

Methods, ideas, or system contributions that make the work stand out.

Machine Translation Evaluation

Sentence-level Aggregation

Resource-poor Languages

🔎 Similar Papers

No similar papers found.