🤖 AI Summary
This study identifies a systemic bias in the preliminary automatic rankings of the WMT25 General Machine Translation Shared Task: systems employing post-hoc techniques—such as quality estimation (QE)-based re-ranking or minimum Bayes risk (MBR) decoding—gain spurious advantages under automatic metrics (e.g., BLEU), leading to rankings misaligned with actual translation quality. To address this, we design and analyze the first early-stage automatic ranking framework tailored to WMT25, quantitatively measuring how re-ranking distorts metric scores and identifying system configurations most vulnerable to such technical bias. Our results demonstrate that automatic-metric-based preliminary rankings are unreliable and require human evaluation for calibration. This work provides empirical evidence to guide WMT organizers in refining evaluation protocols and advances methodological reflection on the limitations of automatic MT evaluation. It constitutes the first systematic diagnosis of re-ranking–induced assessment bias within a WMT task.
📝 Abstract
We present the preliminary ranking of the WMT25 General Machine Translation Shared Task, in which MT systems have been evaluated using automatic metrics. As this ranking is based on automatic evaluations, it may be biased in favor of systems that employ re-ranking techniques, such as Quality Estimation re-ranking or Minimum Bayes Risk decoding. The official WMT25 ranking will be based on human evaluation, which is more reliable and will supersede the automatic ranking. The purpose of this report is not to present the final findings of the General MT task, but rather to share preliminary results with task participants, which may be useful when preparing their system submission papers.