SimulU: Training-free Policy for Long-form Simultaneous Speech-to-Speech Translation

πŸ“… 2026-03-11
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work proposes SimulU, a novel simultaneous speech-to-speech translation strategy that operates without any additional training and effectively handles long-form audioβ€”a longstanding challenge for existing approaches, which typically rely on extensive task-specific training and are limited to short utterances. SimulU leverages cross-attention mechanisms within pretrained end-to-end models to dynamically regulate the use of input history and output generation through a history management and speech output selection scheme. For the first time, it enables high-quality simultaneous speech-to-speech translation on long sequences without fine-tuning. Evaluated on the MuST-C benchmark across eight languages, SimulU matches or surpasses strong cascaded baselines in the trade-off between translation quality and latency, thereby eliminating the dependency on specialized training and short-segment constraints.

Technology Category

Application Category

πŸ“ Abstract
Simultaneous speech-to-speech translation (SimulS2S) is essential for real-time multilingual communication, with increasing integration into meeting and streaming platforms. Despite this, SimulS2S remains underexplored in research, where current solutions often rely on resource-intensive training procedures and operate on short-form, pre-segmented utterances, failing to generalize to continuous speech. To bridge this gap, we propose SimulU, the first training-free policy for long-form SimulS2S. SimulU adopts history management and speech output selection strategies that exploit cross-attention in pre-trained end-to-end models to regulate both input history and output generation. Evaluations on MuST-C across 8 languages show that SimulU achieves a better or comparable quality-latency trade-off against strong cascaded models. By eliminating the need for ad-hoc training, SimulU offers a promising path to end-to-end SimulS2S in realistic, long-form scenarios.
Problem

Research questions and friction points this paper is trying to address.

Simultaneous speech-to-speech translation
long-form speech
real-time multilingual communication
continuous speech
training-intensive methods
Innovation

Methods, ideas, or system contributions that make the work stand out.

training-free
simultaneous speech-to-speech translation
long-form
cross-attention
history management
πŸ”Ž Similar Papers
No similar papers found.