Remedy: Learning Machine Translation Evaluation from Human Preferences with Reward Modeling

📅 2025-04-18

📈 Citations: 0

✨ Influential: 0

career value

190K/year

🤖 AI Summary

Human annotations for machine translation (MT) quality exhibit substantial noise and inconsistency, undermining the robustness of conventional regression-based evaluation metrics; meanwhile, large language models (LLMs) show limited performance in segment-level assessment. To address this, we reformulate MT quality estimation as a reward modeling task grounded in human preference pairs—bypassing direct regression on biased human scores. We propose ReMedy-9B, a model that jointly integrates pairwise preference learning with LLM fine-tuning. Evaluated across 39 language pairs and 111 MT systems from WMT 2022–2024, ReMedy-9B achieves state-of-the-art performance at both segment- and system-levels, substantially outperforming strong baselines including MetricX-13B, GEMBA-GPT-4, and PaLM-540B. Notably, it demonstrates superior capability in identifying low-quality translations and exhibits enhanced robustness in segment-level evaluation.

Technology Category

Application Category

📝 Abstract

A key challenge in MT evaluation is the inherent noise and inconsistency of human ratings. Regression-based neural metrics struggle with this noise, while prompting LLMs shows promise at system-level evaluation but performs poorly at segment level. In this work, we propose ReMedy, a novel MT metric framework that reformulates translation evaluation as a reward modeling task. Instead of regressing on imperfect human ratings directly, ReMedy learns relative translation quality using pairwise preference data, resulting in a more reliable evaluation. In extensive experiments across WMT22-24 shared tasks (39 language pairs, 111 MT systems), ReMedy achieves state-of-the-art performance at both segment- and system-level evaluation. Specifically, ReMedy-9B surpasses larger WMT winners and massive closed LLMs such as MetricX-13B, XCOMET-Ensemble, GEMBA-GPT-4, PaLM-540B, and finetuned PaLM2. Further analyses demonstrate that ReMedy delivers superior capability in detecting translation errors and evaluating low-quality translations.

Problem

Research questions and friction points this paper is trying to address.

Address noise and inconsistency in human MT evaluation ratings

Improve segment-level translation evaluation using reward modeling

Surpass existing metrics in detecting errors and low-quality translations

Innovation

Methods, ideas, or system contributions that make the work stand out.

Reformulates MT evaluation as reward modeling

Learns from pairwise preference data

Achieves state-of-the-art performance

🔎 Similar Papers

Cross-lingual Human-Preference Alignment for Neural Machine Translation with Direct Quality Optimization

2024-09-26arXiv.orgCitations: 0

💼 Related Jobs

No related jobs found.

Authors to Follow