MT-R1-Zero: Advancing LLM-based Machine Translation via R1-Zero-like Reinforcement Learning

📅 2025-04-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Machine translation (MT) faces challenges due to the flexibility of output formatting and the lack of reliable automatic evaluation metrics. Method: This paper pioneers the open adaptation of the R1-Zero reinforcement learning paradigm to MT—requiring neither supervised fine-tuning nor cold-start initialization—and introduces a rule-metric hybrid reward mechanism that jointly leverages interpretable linguistic rules and multidimensional quality metrics to enable large language models (LLMs) to autonomously reason and optimize without explicit human judgments. We propose MT-R1-Zero, a novel framework integrating zero-supervision training, multi-granularity evaluation alignment, and cross-lingual generalization strategies. Contribution/Results: MT-R1-Zero-7B-Mix achieves 62.25 on the WMT24 English–Chinese test set—comparable to GPT-4o and Claude-3.5—while the 7B-Sem variant establishes new state-of-the-art performance on semantic evaluation. The framework significantly enhances robustness in low-resource and out-of-distribution settings.

Technology Category

Application Category

📝 Abstract
Large-scale reinforcement learning (RL) methods have proven highly effective in enhancing the reasoning abilities of large language models (LLMs), particularly for tasks with verifiable solutions such as mathematics and coding. However, applying this idea to machine translation (MT), where outputs are flexibly formatted and difficult to automatically evaluate with explicit rules, remains underexplored. In this work, we introduce MT-R1-Zero, the first open-source adaptation of the R1-Zero RL framework for MT without supervised fine-tuning or cold-start. We propose a rule-metric mixed reward mechanism to guide LLMs towards improved translation quality via emergent reasoning. On the WMT 24 English-Chinese benchmark, our MT-R1-Zero-3B-Mix achieves competitive performance, surpassing TowerInstruct-7B-v0.2 by an average of 1.26 points. Meanwhile, our MT-R1-Zero-7B-Mix attains a high average score of 62.25 across all metrics, placing it on par with advanced proprietary models such as GPT-4o and Claude-3.5-Sonnet, while the MT-R1-Zero-7B-Sem variant achieves state-of-the-art scores on semantic metrics. Moreover, our work exhibits strong generalization capabilities on out-of-distribution MT tasks, robustly supporting multilingual and low-resource settings. Extensive analysis of model behavior across different initializations and reward metrics offers pioneering insight into the critical role of reward design, LLM adaptability, training dynamics, and emergent reasoning patterns within the R1-Zero paradigm for MT. Our code is available at https://github.com/fzp0424/MT-R1-Zero.
Problem

Research questions and friction points this paper is trying to address.

Enhancing machine translation via R1-Zero RL without supervised fine-tuning
Overcoming flexible output evaluation challenges in LLM-based translation
Achieving competitive performance against advanced proprietary translation models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses R1-Zero RL framework for machine translation
Implements rule-metric mixed reward mechanism
Achieves competitive performance without supervised fine-tuning
🔎 Similar Papers
No similar papers found.