MetricX-25 and GemSpanEval: Google Translate Submissions to the WMT25 Evaluation Shared Task

📅 2025-10-28
📈 Citations: 1
Influential: 1
📄 PDF
🤖 AI Summary
This work addresses the WMT25 translation evaluation task, targeting two subtasks: quality score prediction and error span detection. We propose a dual-system approach: MetricX-25, an encoder-only regression model fine-tuned from the multilingual Gemma-3 foundation model to jointly predict MQM and ESA scores; and GemSpanEval, a generative decoder-based model that frames error detection as a structured text generation task, explicitly outputting error spans, categories, and severity levels. Both models are trained exclusively on publicly available WMT data. Experimental results show that MetricX-25 achieves significant improvements over prior state-of-the-art models in correlation with human judgments. GemSpanEval matches the strong baseline xCOMET on error span detection and, for the first time, enables end-to-end generation of fine-grained error contexts—including precise token-level spans and associated metadata—thereby enhancing interpretability and practical utility for human-in-the-loop evaluation.

Technology Category

Application Category

📝 Abstract
In this paper, we present our submissions to the unified WMT25 Translation Evaluation Shared Task. For the Quality Score Prediction subtask, we create a new generation of MetricX with improvements in the input format and the training protocol, while for the Error Span Detection subtask we develop a new model, GemSpanEval, trained to predict error spans along with their severities and categories. Both systems are based on the state-of-the-art multilingual open-weights model Gemma 3, fine-tuned on publicly available WMT data. We demonstrate that MetricX-25, adapting Gemma 3 to an encoder-only architecture with a regression head on top, can be trained to effectively predict both MQM and ESA quality scores, and significantly outperforms its predecessor. Our decoder-only GemSpanEval model, on the other hand, we show to be competitive in error span detection with xCOMET, a strong encoder-only sequence-tagging baseline. With error span detection formulated as a generative task, we instruct the model to also output the context for each predicted error span, thus ensuring that error spans are identified unambiguously.
Problem

Research questions and friction points this paper is trying to address.

Develop MetricX-25 for quality score prediction in translation
Create GemSpanEval for detecting error spans in translations
Fine-tune Gemma 3 models for WMT25 evaluation tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

MetricX-25 uses encoder-only Gemma 3 with regression head
GemSpanEval employs decoder-only Gemma 3 for error detection
Both systems fine-tune Gemma 3 on WMT data
🔎 Similar Papers
No similar papers found.