RL from Teacher-Model Refinement: Gradual Imitation Learning for Machine Translation

📅 2025-07-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing preference learning methods (e.g., DPO) rely on large-scale static preference triples, suffering from poor generalization and weak cross-domain transferability. This paper proposes RLfR—a reinforcement learning framework for translation that replaces static preference modeling with dynamic, fine-grained feedback from an external teacher model (e.g., GPT-4o), framing translation as progressive micro-teaching and imitation learning. RLfR jointly optimizes a dual-signal reward combining normalized negative edit distance and COMET score to enable real-time hypothesis refinement—eliminating the need for human-annotated triples and better aligning with human incremental learning. Evaluated on the FLORES-200 multilingual benchmark, RLfR significantly outperforms MT-SFT and leading preference-based baselines. It achieves consistent improvements in COMET semantic adequacy and M-ETA entity retention, demonstrating the effectiveness and generalizability of the dynamic teacher-driven paradigm.

Technology Category

Application Category

📝 Abstract
Preference-learning methods for machine translation (MT)--such as Direct Preference Optimization (DPO)--have achieved impressive gains but depend heavily on large, carefully curated triplet datasets and often struggle to generalize beyond their tuning domains. We propose Reinforcement Learning from Teacher-Model Refinement (RLfR), a novel framework that removes reliance on static triplets by leveraging continuous, high-quality feedback from an external teacher model (GPT-4o). RLfR frames each translation step as a micro-tutorial: the actor generates a hypothesis, the teacher refines it, and the actor is rewarded based on how closely it aligns with the teacher's refinement. Guided by two complementary signals--(i) negative edit distance, promoting lexical and structural fidelity, and (ii) COMET score, ensuring semantic adequacy--the actor progressively learns to emulate the teacher, mirroring a human learning process through incremental, iterative improvement. On the FLORES-200 benchmark (English to and from German, Spanish, Chinese, Korean, and Japanese), RLfR consistently outperforms both MT-SFT and preference-based baselines, significantly improving COMET (semantic adequacy) and M-ETA (entity preservation) scores.
Problem

Research questions and friction points this paper is trying to address.

Reducing reliance on static triplet datasets for machine translation preference learning
Improving generalization beyond tuning domains in machine translation
Enhancing semantic adequacy and entity preservation in translations
Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages continuous GPT-4o teacher feedback
Combines negative edit distance and COMET
Micro-tutorial actor-teacher refinement process
🔎 Similar Papers
No similar papers found.
D
Dongyub Jude Lee
Zoom Communications
Z
Zhenyi Ye
Zoom Communications
Pengcheng He
Pengcheng He
Microsoft
Machine Learning