Preliminary Ranking of WMT25 General Machine Translation Systems

📅 2025-08-11

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

This study identifies a systemic bias in the preliminary automatic rankings of the WMT25 General Machine Translation Shared Task: systems employing post-hoc techniques—such as quality estimation (QE)-based re-ranking or minimum Bayes risk (MBR) decoding—gain spurious advantages under automatic metrics (e.g., BLEU), leading to rankings misaligned with actual translation quality. To address this, we design and analyze the first early-stage automatic ranking framework tailored to WMT25, quantitatively measuring how re-ranking distorts metric scores and identifying system configurations most vulnerable to such technical bias. Our results demonstrate that automatic-metric-based preliminary rankings are unreliable and require human evaluation for calibration. This work provides empirical evidence to guide WMT organizers in refining evaluation protocols and advances methodological reflection on the limitations of automatic MT evaluation. It constitutes the first systematic diagnosis of re-ranking–induced assessment bias within a WMT task.

Technology Category

Application Category

📝 Abstract

We present the preliminary ranking of the WMT25 General Machine Translation Shared Task, in which MT systems have been evaluated using automatic metrics. As this ranking is based on automatic evaluations, it may be biased in favor of systems that employ re-ranking techniques, such as Quality Estimation re-ranking or Minimum Bayes Risk decoding. The official WMT25 ranking will be based on human evaluation, which is more reliable and will supersede the automatic ranking. The purpose of this report is not to present the final findings of the General MT task, but rather to share preliminary results with task participants, which may be useful when preparing their system submission papers.

Problem

Research questions and friction points this paper is trying to address.

Preliminary ranking of WMT25 MT systems using automatic metrics

Automatic evaluation may bias systems using re-ranking techniques

Official ranking will use human evaluation to supersede results

Innovation

Methods, ideas, or system contributions that make the work stand out.

Using automatic evaluation metrics for ranking

Employing re-ranking techniques like Quality Estimation

Applying Minimum Bayes Risk decoding methods

🔎 Similar Papers

No similar papers found.