🤖 AI Summary
This work addresses the WMT25 translation evaluation task, targeting two subtasks: quality score prediction and error span detection. We propose a dual-system approach: MetricX-25, an encoder-only regression model fine-tuned from the multilingual Gemma-3 foundation model to jointly predict MQM and ESA scores; and GemSpanEval, a generative decoder-based model that frames error detection as a structured text generation task, explicitly outputting error spans, categories, and severity levels. Both models are trained exclusively on publicly available WMT data. Experimental results show that MetricX-25 achieves significant improvements over prior state-of-the-art models in correlation with human judgments. GemSpanEval matches the strong baseline xCOMET on error span detection and, for the first time, enables end-to-end generation of fine-grained error contexts—including precise token-level spans and associated metadata—thereby enhancing interpretability and practical utility for human-in-the-loop evaluation.
📝 Abstract
In this paper, we present our submissions to the unified WMT25 Translation Evaluation Shared Task. For the Quality Score Prediction subtask, we create a new generation of MetricX with improvements in the input format and the training protocol, while for the Error Span Detection subtask we develop a new model, GemSpanEval, trained to predict error spans along with their severities and categories. Both systems are based on the state-of-the-art multilingual open-weights model Gemma 3, fine-tuned on publicly available WMT data. We demonstrate that MetricX-25, adapting Gemma 3 to an encoder-only architecture with a regression head on top, can be trained to effectively predict both MQM and ESA quality scores, and significantly outperforms its predecessor. Our decoder-only GemSpanEval model, on the other hand, we show to be competitive in error span detection with xCOMET, a strong encoder-only sequence-tagging baseline. With error span detection formulated as a generative task, we instruct the model to also output the context for each predicted error span, thus ensuring that error spans are identified unambiguously.