RIVAL: Reinforcement Learning with Iterative and Adversarial Optimization for Machine Translation

📅 2025-06-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Reinforcement Learning from Human Feedback (RLHF) suffers severe performance degradation in spoken-language subtitle translation due to significant distributional shift between the offline reward model (RM) and the online LLM policy. Method: We propose RIVAL, a novel framework that formulates translation optimization as an iterative adversarial game between the RM and the translation model. It introduces reference-free hybrid reward modeling—jointly leveraging qualitative human preference signals and quantitative metrics (e.g., BLEU)—and dynamically co-adapts both the RM and the translation model to mitigate distributional shift. Results: Evaluated on multiple spoken-language translation benchmarks, RIVAL substantially outperforms state-of-the-art methods, achieving significant improvements in translation fluency, faithfulness, and alignment with human judgments.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) possess strong multilingual capabilities, and combining Reinforcement Learning from Human Feedback (RLHF) with translation tasks has shown great potential. However, we observe that this paradigm performs unexpectedly poorly when applied to colloquial subtitle translation tasks. In this work, we investigate this issue and find that the offline reward model (RM) gradually diverges from the online LLM due to distributional shift, ultimately leading to undesirable training outcomes. To address this, we propose RIVAL, an adversarial training framework that formulates the process as a min-max game between the RM and the LLM. RIVAL iteratively updates the both models, with the RM trained to distinguish strong from weak translations (qualitative preference reward), and the LLM trained to enhance its translation for closing this gap. To stabilize training and improve generalizability, we also incorporate quantitative preference reward (e.g., BLEU) into the RM, enabling reference-free quality modeling aligned with human evaluation. Through extensive experiments, we demonstrate that the proposed adversarial training framework significantly improves upon translation baselines.
Problem

Research questions and friction points this paper is trying to address.

Addresses poor performance of RLHF in colloquial subtitle translation
Mitigates reward model divergence due to distributional shift
Proposes adversarial training to align RM and LLM iteratively
Innovation

Methods, ideas, or system contributions that make the work stand out.

Adversarial training framework for translation
Iterative updates for RM and LLM
Combines qualitative and quantitative rewards
🔎 Similar Papers
No similar papers found.
T
Tianjiao Li
Bilibili Inc., Shanghai, China
M
Mengran Yu
Bilibili Inc., Shanghai, China
C
Chenyu Shi
School of Computer Science, Fudan University, China
Yanjun Zhao
Yanjun Zhao
UIUC
ml
X
Xiaojing Liu
Bilibili Inc., Shanghai, China
Q
Qiang Zhang
Bilibili Inc., Shanghai, China
Q
Qi Zhang
School of Computer Science, Fudan University, China
X
Xuanjing Huang
School of Computer Science, Fudan University, China
Jiayin Wang
Jiayin Wang
Tsinghua University
User ModelingPersonalization