Remedy-R: Generative Reasoning for Machine Translation Evaluation without Error Annotations

📅 2025-12-21

📈 Citations: 0

✨ Influential: 0

career value

183K/year

🤖 AI Summary

Existing automatic machine translation (MT) evaluation metrics achieve near-human performance on standard benchmarks but suffer from poor interpretability and weak robustness under out-of-distribution (OOD) conditions. Method: We propose Remedy-R—a generative large language model (LLM)-based MT evaluation method that requires no error annotations. It learns directly from paired preference data via a reinforcement learning from human feedback (RLHF)-inspired approach, producing structured reasoning chains and scores across accuracy, fluency, and completeness. Contribution/Results: Remedy-R pioneers an “annotation-free, non-distillation, generative reasoning” evaluation paradigm, enabling self-reflective feedback and supporting an evaluate-revise intelligent agent. Trained on only 60K samples, it surpasses strong baselines—including GPT-4—on WMT 2022–2024 benchmarks. It demonstrates superior cross-lingual generalization and OOD robustness, and significantly improves translation quality of models such as Qwen2.5 and ALMA-R.

Technology Category

Application Category

📝 Abstract

Over the years, automatic MT metrics have hillclimbed benchmarks and presented strong and sometimes human-level agreement with human ratings. Yet they remain black-box, offering little insight into their decision-making and often failing under real-world out-of-distribution (OOD) inputs. We introduce Remedy-R, a reasoning-driven generative MT metric trained with reinforcement learning from pairwise translation preferences, without requiring error-span annotations or distillation from closed LLMs. Remedy-R produces step-by-step analyses of accuracy, fluency, and completeness, followed by a final score, enabling more interpretable assessments. With only 60K training pairs across two language pairs, Remedy-R remains competitive with top scalar metrics and GPT-4-based judges on WMT22-24 meta-evaluation, generalizes to other languages, and exhibits strong robustness on OOD stress tests. Moreover, Remedy-R models generate self-reflective feedback that can be reused for translation improvement. Building on this finding, we introduce Remedy-R Agent, a simple evaluate-revise pipeline that leverages Remedy-R's evaluation analysis to refine translations. This agent consistently improves translation quality across diverse models, including Qwen2.5, ALMA-R, GPT-4o-mini, and Gemini-2.0-Flash, suggesting that Remedy-R's reasoning captures translation-relevant information and is practically useful.

Problem

Research questions and friction points this paper is trying to address.

Develops an interpretable metric for machine translation evaluation without error annotations

Addresses robustness issues of existing metrics on out-of-distribution inputs

Enables translation improvement through self-reflective feedback and revision

Innovation

Methods, ideas, or system contributions that make the work stand out.

Generative reasoning metric without error annotations

Reinforcement learning from pairwise translation preferences

Self-reflective feedback for translation improvement pipeline

🔎 Similar Papers

AI-Assisted Human Evaluation of Machine Translation