PEGRL: Improving Machine Translation by Post-Editing Guided Reinforcement Learning

📅 2026-02-03

📈 Citations: 0

✨ Influential: 0

career value

181K/year

🤖 AI Summary

This work addresses the challenges faced by large language models in reinforcement learning (RL)–based machine translation, particularly the high variance of Monte Carlo return estimates and insufficient local optimization due to the vast trajectory space. To overcome these limitations, the authors propose PEGRL, a two-stage RL framework that innovatively incorporates post-editing as an auxiliary task. In each iteration, PEGRL generates post-editing inputs from current translations and employs a conditional return estimation mechanism to jointly enhance global exploration and fine-grained local optimization. A task-specific weighting strategy is further introduced to balance the objectives of translation and post-editing. Experimental results demonstrate that PEGRL significantly outperforms existing RL baselines across English→Finnish, English→Turkish, and English↔Chinese translation tasks, with English→Turkish performance on the COMET-KIWI metric rivaling that of the state-of-the-art large model system DeepSeek-V3.2, thereby confirming its sample efficiency and optimization stability.

Technology Category

Application Category

📝 Abstract

Reinforcement learning (RL) has shown strong promise for LLM-based machine translation, with recent methods such as GRPO demonstrating notable gains; nevertheless, translation-oriented RL remains challenged by noisy learning signals arising from Monte Carlo return estimation, as well as a large trajectory space that favors global exploration over fine-grained local optimization. We introduce \textbf{PEGRL}, a \textit{two-stage} RL framework that uses post-editing as an auxiliary task to stabilize training and guide overall optimization. At each iteration, translation outputs are sampled to construct post-editing inputs, allowing return estimation in the post-editing stage to benefit from conditioning on the current translation behavior, while jointly supporting both global exploration and fine-grained local optimization. A task-specific weighting scheme further balances the contributions of translation and post-editing objectives, yielding a biased yet more sample-efficient estimator. Experiments on English$\to$Finnish, English$\to$Turkish, and English$\leftrightarrow$Chinese show consistent gains over RL baselines, and for English$\to$Turkish, performance on COMET-KIWI is comparable to advanced LLM-based systems (DeepSeek-V3.2).

Problem

Research questions and friction points this paper is trying to address.

reinforcement learning

machine translation

noisy learning signals

trajectory space

local optimization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Post-Editing Guided RL

Two-stage Reinforcement Learning

Machine Translation