🤖 AI Summary
This paper investigates how the adequacy–fluency trade-off in machine translation (MT) evaluation affects automatic metric performance and meta-evaluation outcomes. Through correlation analysis and a meta-evaluation framework applied to WMT data, we find that mainstream automatic metrics exhibit a systematic adequacy bias, and their meta-evaluation rankings are highly sensitive to the composition of participating MT systems—a previously under-addressed source of compositional bias. To address this, we propose a novel synthetic-system-based meta-evaluation control method: by controllably generating diverse system combinations with calibrated adequacy–fluency profiles, we mitigate composition-induced distortions. Experiments demonstrate that our approach significantly improves the fairness and robustness of metric rankings. Crucially, this work is the first to identify and correct the adequacy–fluency trade-off bias at the meta-evaluation level, providing both theoretical insights and practical guidelines for building more balanced and trustworthy MT evaluation frameworks.
📝 Abstract
We investigate the tradeoff between adequacy and fluency in machine translation. We show the severity of this tradeoff at the evaluation level and analyze where popular metrics fall within it. Essentially, current metrics generally lean toward adequacy, meaning that their scores correlate more strongly with the adequacy of translations than with fluency. More importantly, we find that this tradeoff also persists at the meta-evaluation level, and that the standard WMT meta-evaluation favors adequacy-oriented metrics over fluency-oriented ones. We show that this bias is partially attributed to the composition of the systems included in the meta-evaluation datasets. To control this bias, we propose a method that synthesizes translation systems in meta-evaluation. Our findings highlight the importance of understanding this tradeoff in meta-evaluation and its impact on metric rankings.