🤖 AI Summary
Conventional machine translation evaluation relies on single scalar metrics (e.g., BLEU) to jointly assess semantic accuracy and fluency—two inherently conflicting objectives—yet no theoretical foundation justifies their aggregation.
Method: We establish, for the first time, an information-theoretic lower bound proving the fundamental trade-off between accuracy and fluency. We propose a two-dimensional evaluation paradigm—separately quantifying precision and naturalness—replacing traditional unidimensional scoring. Using WMT24 systems, we conduct empirical analysis integrating multi-dimensional human judgments with automated metrics.
Contribution/Results: Our analysis empirically confirms a critical “turning point”: beyond a certain accuracy threshold, further improvements systematically degrade fluency. This work provides both theoretical grounding and empirical evidence for reforming MT evaluation standards, advocating orthogonal assessment of fidelity and fluency rather than conflating them into a single score.
📝 Abstract
The goal of translation, be it by human or by machine, is, given some text in a source language, to produce text in a target language that simultaneously 1) preserves the meaning of the source text and 2) achieves natural expression in the target language. However, researchers in the machine translation community usually assess translations using a single score intended to capture semantic accuracy and the naturalness of the output simultaneously. In this paper, we build on recent advances in information theory to mathematically prove and empirically demonstrate that such single-score summaries do not and cannot give the complete picture of a system's true performance. Concretely, we prove that a tradeoff exists between accuracy and naturalness and demonstrate it by evaluating the submissions to the WMT24 shared task. Our findings help explain well-known empirical phenomena, such as the observation that optimizing translation systems for a specific accuracy metric (like BLEU) initially improves the system's naturalness, while ``overfitting'' the system to the metric can significantly degrade its naturalness. Thus, we advocate for a change in how translations are evaluated: rather than comparing systems using a single number, they should be compared on an accuracy-naturalness plane.