Hierarchical Policy Optimization for Simultaneous Translation of Unbounded Speech

📅 2026-04-22

📈 Citations: 0

✨ Influential: 0

career value

174K/year

🤖 AI Summary

This work addresses the challenges in simultaneous speech translation—namely, the scarcity of high-quality conversational supervision data, the inadequacy of synthetic data, and the high computational cost of large language models—by proposing a hierarchical policy optimization framework for multi-turn simultaneous translation. The approach performs post-training on imperfect supervision data, reduces inference cost through KV cache reuse, and introduces a hierarchical reward mechanism that jointly optimizes translation quality and latency. Experimental results on English-to-Chinese, German, and Japanese tasks demonstrate that the method achieves significant improvements over existing baselines, yielding COMET gains exceeding 7 points and MetricX gains of 1.25 points at a 1.5-second latency, thereby validating the effectiveness of the proposed reward structure, quality signals, and segmentation strategy.

Technology Category

Application Category

📝 Abstract

Simultaneous speech translation (SST) generates translations while receiving partial speech input. Recent advances show that large language models (LLMs) can substantially improve SST quality, but at the cost of high computational overhead. To reduce this cost, prior work reformulates SST as a multi-turn dialogue task, enabling full reuse of the LLM's key-value (KV) cache and eliminating redundant feature recomputation. However, this approach relies on supervised fine-tuning (SFT) data in dialogue form, for which few human annotations exist, and existing synthesis methods cannot guarantee data quality. In this work, we propose a Hierarchical Policy Optimization (HPO) approach that post-train models trained on imperfect SFT data. We introduce a hierarchical reward that balances translation quality and latency objectives. Experiments on English to Chinese/German/Japanese demonstrate improvements of over +7 COMET score and +1.25 MetricX score at a latency of 1.5 seconds. Comprehensive ablation studies further validate the effectiveness of different quality rewards, hierarchical reward formulations, and segmentation strategies. Code can be found here https://github.com/owaski/HPO

Problem

Research questions and friction points this paper is trying to address.

Simultaneous Speech Translation

Supervised Fine-Tuning

Data Quality

KV Cache Reuse

Latency

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical Policy Optimization

Simultaneous Speech Translation

Large Language Models

KV Cache Reuse

Latency-Quality Trade-off

🔎 Similar Papers

No similar papers found.