🤖 AI Summary
This study addresses cross-lingual scoring bias in existing automatic machine translation evaluation metrics, a problem hindered by the lack of parallel datasets with consistent quality annotations across languages. To overcome this limitation, the authors propose XQ-MEval, the first benchmark dataset enabling cross-lingual parallel quality evaluation. Built upon the MQM error taxonomy, the dataset is constructed by automatically injecting errors into high-quality reference translations and then filtering the resulting pseudo-translations through native speakers to ensure controlled quality levels, yielding source–reference–pseudo-translation triplets. Experiments across nine language directions reveal that nine widely used metrics consistently exhibit cross-lingual biases misaligned with human judgments. The paper further introduces a score normalization strategy that substantially improves fairness and correlation with human assessments in multilingual evaluation settings.
📝 Abstract
Automatic evaluation metrics are essential for building multilingual translation systems. The common practice of evaluating these systems is averaging metric scores across languages, yet this is suspicious since metrics may suffer from cross-lingual scoring bias, where translations of equal quality receive different scores across languages. This problem has not been systematically studied because no benchmark exists that provides parallel-quality instances across languages, and expert annotation is not realistic. In this work, we propose XQ-MEval, a semi-automatically built dataset covering nine translation directions, to benchmark translation metrics. Specifically, we inject MQM-defined errors into gold translations automatically, filter them by native speakers for reliability, and merge errors to generate pseudo translations with controllable quality. These pseudo translations are then paired with corresponding sources and references to form triplets used in assessing the qualities of translation metrics. Using XQ-MEval, our experiments on nine representative metrics reveal the inconsistency between averaging and human judgment and provide the first empirical evidence of cross-lingual scoring bias. Finally, we propose a normalization strategy derived from XQ-MEval that aligns score distributions across languages, improving the fairness and reliability of multilingual metric evaluation.