🤖 AI Summary
Existing automatic machine translation (MT) evaluation metrics achieve near-human performance on standard benchmarks but suffer from poor interpretability and weak robustness under out-of-distribution (OOD) conditions.
Method: We propose Remedy-R—a generative large language model (LLM)-based MT evaluation method that requires no error annotations. It learns directly from paired preference data via a reinforcement learning from human feedback (RLHF)-inspired approach, producing structured reasoning chains and scores across accuracy, fluency, and completeness.
Contribution/Results: Remedy-R pioneers an “annotation-free, non-distillation, generative reasoning” evaluation paradigm, enabling self-reflective feedback and supporting an evaluate-revise intelligent agent. Trained on only 60K samples, it surpasses strong baselines—including GPT-4—on WMT 2022–2024 benchmarks. It demonstrates superior cross-lingual generalization and OOD robustness, and significantly improves translation quality of models such as Qwen2.5 and ALMA-R.
📝 Abstract
Over the years, automatic MT metrics have hillclimbed benchmarks and presented strong and sometimes human-level agreement with human ratings. Yet they remain black-box, offering little insight into their decision-making and often failing under real-world out-of-distribution (OOD) inputs. We introduce Remedy-R, a reasoning-driven generative MT metric trained with reinforcement learning from pairwise translation preferences, without requiring error-span annotations or distillation from closed LLMs. Remedy-R produces step-by-step analyses of accuracy, fluency, and completeness, followed by a final score, enabling more interpretable assessments. With only 60K training pairs across two language pairs, Remedy-R remains competitive with top scalar metrics and GPT-4-based judges on WMT22-24 meta-evaluation, generalizes to other languages, and exhibits strong robustness on OOD stress tests. Moreover, Remedy-R models generate self-reflective feedback that can be reused for translation improvement. Building on this finding, we introduce Remedy-R Agent, a simple evaluate-revise pipeline that leverages Remedy-R's evaluation analysis to refine translations. This agent consistently improves translation quality across diverse models, including Qwen2.5, ALMA-R, GPT-4o-mini, and Gemini-2.0-Flash, suggesting that Remedy-R's reasoning captures translation-relevant information and is practically useful.