Preliminary Ranking of WMT25 General Machine Translation Systems

📅 2025-08-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study identifies a systemic bias in the preliminary automatic rankings of the WMT25 General Machine Translation Shared Task: systems employing post-hoc techniques—such as quality estimation (QE)-based re-ranking or minimum Bayes risk (MBR) decoding—gain spurious advantages under automatic metrics (e.g., BLEU), leading to rankings misaligned with actual translation quality. To address this, we design and analyze the first early-stage automatic ranking framework tailored to WMT25, quantitatively measuring how re-ranking distorts metric scores and identifying system configurations most vulnerable to such technical bias. Our results demonstrate that automatic-metric-based preliminary rankings are unreliable and require human evaluation for calibration. This work provides empirical evidence to guide WMT organizers in refining evaluation protocols and advances methodological reflection on the limitations of automatic MT evaluation. It constitutes the first systematic diagnosis of re-ranking–induced assessment bias within a WMT task.

Technology Category

Application Category

📝 Abstract
We present the preliminary ranking of the WMT25 General Machine Translation Shared Task, in which MT systems have been evaluated using automatic metrics. As this ranking is based on automatic evaluations, it may be biased in favor of systems that employ re-ranking techniques, such as Quality Estimation re-ranking or Minimum Bayes Risk decoding. The official WMT25 ranking will be based on human evaluation, which is more reliable and will supersede the automatic ranking. The purpose of this report is not to present the final findings of the General MT task, but rather to share preliminary results with task participants, which may be useful when preparing their system submission papers.
Problem

Research questions and friction points this paper is trying to address.

Preliminary ranking of WMT25 MT systems using automatic metrics
Automatic evaluation may bias systems using re-ranking techniques
Official ranking will use human evaluation to supersede results
Innovation

Methods, ideas, or system contributions that make the work stand out.

Using automatic evaluation metrics for ranking
Employing re-ranking techniques like Quality Estimation
Applying Minimum Bayes Risk decoding methods
🔎 Similar Papers
No similar papers found.
Tom Kocmi
Tom Kocmi
Cohere
Multilingual EvaluationLLMsMachine Translation
Eleftherios Avramidis
Eleftherios Avramidis
Senior Researcher at German Research Center for Artificial Intelligence (DFKI)
LLMsMultilingualityMachine TranslationSign Language Processing
Rachel Bawden
Rachel Bawden
Inria
Natural Language ProcessingMachine Translation
O
O. Bojar
K
Konstantin Dranch
A
Anton Dvorkovich
S
Sergey Dukanov
N
Natalia Fedorova
Mark Fishel
Mark Fishel
Professor of NLP, University of Tartu
Natural language processingmachine translationlow-resource NLP#unitartucs#tartunlp
M
Markus Freitag
T
Thamme Gowda
Roman Grundkiewicz
Roman Grundkiewicz
Microsoft
Machine TranslationHuman EvaluationGrammatical Error Correction
B
B. Haddow
Marzena Karpinska
Marzena Karpinska
Senior Researcher at Microsoft
natural language processinglanguage modelsevaluation
Philipp Koehn
Philipp Koehn
Professor, Johns Hopkins University
Machine TranslationNatural Language Processing
H
Howard Lakougna
J
Jessica Lundin
Kenton Murray
Kenton Murray
Research Scientist, Johns Hopkins
Machine LearningNatural Language ProcessingMachine TranslationSemanticsNeural Networks
Masaaki Nagata
Masaaki Nagata
NTT Corporation
Machine Translation
Stefano Perrella
Stefano Perrella
PhD student @Sapienza NLP, Applied Scientist Intern @ Amazon
Machine Translation
Lorenzo Proietti
Lorenzo Proietti
PhD Student @ Sapienza NLP
Natural Language ProcessingDeep LearningMachine Translation Evaluation
M
M. Popel
M
Maja Popovi'c
Parker Riley
Parker Riley
Google Research
Natural Language ProcessingMachine Translation
M
Mariya Shmatova
S
Steinth'or Steingr'imsson
L
L. Yankovskaya
Vilém Zouhar
Vilém Zouhar
PhD, ETH Zürich
Natural Language ProcessingQuality EstimationMachine Translation