SeqPO-SiMT: Sequential Policy Optimization for Simultaneous Machine Translation

📅 2025-05-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Addressing the longstanding trade-off between translation quality and latency in simultaneous machine translation (SiMT), as well as insufficient modeling of multi-step decision-making, this paper proposes the first sequence-level policy optimization framework tailored for SiMT. It formalizes SiMT as a sequential decision process, introduces a joint latency–quality reward function, and integrates reinforcement learning with online streaming decoding control. Unlike conventional single-step RLHF approaches, our framework enables 7B-scale large language models to achieve performance on par with state-of-the-art offline models (e.g., Qwen-2.5-7B, LLaMA-3-8B) on SiMT tasks. Evaluated across six En↔Zh benchmarks, it yields an average COMET score improvement of +1.13 and a mean latency reduction of −6.17, significantly outperforming existing SiMT methods.

Technology Category

Application Category

📝 Abstract
We present Sequential Policy Optimization for Simultaneous Machine Translation (SeqPO-SiMT), a new policy optimization framework that defines the simultaneous machine translation (SiMT) task as a sequential decision making problem, incorporating a tailored reward to enhance translation quality while reducing latency. In contrast to popular Reinforcement Learning from Human Feedback (RLHF) methods, such as PPO and DPO, which are typically applied in single-step tasks, SeqPO-SiMT effectively tackles the multi-step SiMT task. This intuitive framework allows the SiMT LLMs to simulate and refine the SiMT process using a tailored reward. We conduct experiments on six datasets from diverse domains for En to Zh and Zh to En SiMT tasks, demonstrating that SeqPO-SiMT consistently achieves significantly higher translation quality with lower latency. In particular, SeqPO-SiMT outperforms the supervised fine-tuning (SFT) model by 1.13 points in COMET, while reducing the Average Lagging by 6.17 in the NEWSTEST2021 En to Zh dataset. While SiMT operates with far less context than offline translation, the SiMT results of SeqPO-SiMT on 7B LLM surprisingly rival the offline translation of high-performing LLMs, including Qwen-2.5-7B-Instruct and LLaMA-3-8B-Instruct.
Problem

Research questions and friction points this paper is trying to address.

Optimizes simultaneous machine translation via sequential decision-making
Enhances translation quality while reducing latency in SiMT
Outperforms traditional methods in multi-step SiMT tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Sequential policy optimization for SiMT
Tailored reward enhances quality and latency
Outperforms SFT and RLHF in multi-step tasks
🔎 Similar Papers
No similar papers found.
T
Ting Xu
The Chinese University of Hong Kong, Bytedance
Z
Zhichao Huang
Bytedance
J
Jiankai Sun
Stanford University
Shanbo Cheng
Shanbo Cheng
ByteDance Seed
LLMsMLNLPMachine TranslationMulti modal