🤖 AI Summary
Traditional simultaneous machine translation (SiMT) struggles to balance translation quality and low latency under strict real-time constraints; incremental READ/WRITE strategies alone suffer from an inherent trade-off between semantic fidelity and responsiveness. This paper proposes a human-like simultaneous interpreting framework that, for the first time, introduces four adaptive actions—segmentation, deletion, partial summarization, and pronominalization—into the SiMT action space, enabling joint optimization of semantic compression and rhythmic control. Leveraging a decoder-only large language model, we design action-aware prompting to construct training data and establish a latency-aware text-to-speech (TTS) evaluation pipeline. On the ACL60/60 English–Chinese and English–German benchmarks, our method achieves significant improvements in COMET-KIWI scores while reducing average latency, consistently outperforming all existing baselines.
📝 Abstract
Simultaneous Machine Translation (SiMT) requires high-quality translations under strict real-time constraints, which traditional encoder-decoder policies with only READ/WRITE actions cannot fully address. We extend the action space of SiMT with four adaptive actions: SENTENCE_CUT, DROP, PARTIAL_SUMMARIZATION and PRONOMINALIZATION, which enable real-time restructuring, omission, and simplification while preserving semantic fidelity. We implement these actions in a decoder-only large language model (LLM) framework and construct training references through action-aware prompting. To evaluate both quality and latency, we further develop a latency-aware TTS pipeline that maps textual outputs to speech with realistic timing. Experiments on the ACL60/60 English-Chinese and English-German benchmarks show that our framework consistently improves semantic metrics (e.g., COMET-KIWI) and achieves lower delay (measured by Average Lagging) compared to reference translations and salami-based baselines. Notably, combining DROP and SENTENCE_CUT yields the best overall balance between fluency and latency. These results demonstrate that enriching the action space of LLM-based SiMT provides a promising direction for bridging the gap between human and machine interpretation.