🤖 AI Summary
Existing preference learning methods (e.g., DPO) rely on large-scale static preference triples, suffering from poor generalization and weak cross-domain transferability. This paper proposes RLfR—a reinforcement learning framework for translation that replaces static preference modeling with dynamic, fine-grained feedback from an external teacher model (e.g., GPT-4o), framing translation as progressive micro-teaching and imitation learning. RLfR jointly optimizes a dual-signal reward combining normalized negative edit distance and COMET score to enable real-time hypothesis refinement—eliminating the need for human-annotated triples and better aligning with human incremental learning. Evaluated on the FLORES-200 multilingual benchmark, RLfR significantly outperforms MT-SFT and leading preference-based baselines. It achieves consistent improvements in COMET semantic adequacy and M-ETA entity retention, demonstrating the effectiveness and generalizability of the dynamic teacher-driven paradigm.
📝 Abstract
Preference-learning methods for machine translation (MT)--such as Direct Preference Optimization (DPO)--have achieved impressive gains but depend heavily on large, carefully curated triplet datasets and often struggle to generalize beyond their tuning domains. We propose Reinforcement Learning from Teacher-Model Refinement (RLfR), a novel framework that removes reliance on static triplets by leveraging continuous, high-quality feedback from an external teacher model (GPT-4o). RLfR frames each translation step as a micro-tutorial: the actor generates a hypothesis, the teacher refines it, and the actor is rewarded based on how closely it aligns with the teacher's refinement. Guided by two complementary signals--(i) negative edit distance, promoting lexical and structural fidelity, and (ii) COMET score, ensuring semantic adequacy--the actor progressively learns to emulate the teacher, mirroring a human learning process through incremental, iterative improvement. On the FLORES-200 benchmark (English to and from German, Spanish, Chinese, Korean, and Japanese), RLfR consistently outperforms both MT-SFT and preference-based baselines, significantly improving COMET (semantic adequacy) and M-ETA (entity preservation) scores.