🤖 AI Summary
This work addresses the challenge of non-monotonic word dependencies in real-time speech-to-speech translation, a setting where existing approaches rely on scarce word-level alignment data or language-specific heuristics, limiting their scalability to multilingual scenarios. The authors propose a fully end-to-end framework that eliminates the need for word-level alignments: first pretraining a high-latency speech translation model on sentence-level aligned data, then optimizing the latency policy via GRPO reinforcement learning. This approach is the first to completely dispense with word-level supervision, substantially simplifying training and enabling seamless cross-lingual transfer. It rapidly adapts to new languages with less than 1,000 hours of speech data. Evaluated on five X-to-English tasks, the model achieves state-of-the-art performance in translation quality, latency, voice preservation, and naturalness, with models, code, and a 45-hour multilingual evaluation benchmark publicly released.
📝 Abstract
Simultaneous speech translation requires translating source speech into a target language in real-time while handling non-monotonic word dependencies. Traditional approaches rely on supervised training with word-level aligned data, which is difficult to collect at scale and thus depends on synthetic alignments using language-specific heuristics that are suboptimal. We propose Hibiki-Zero, which eliminates the need for word-level alignments entirely. This fundamentally simplifies the training pipeline and enables seamless scaling to diverse languages with varying grammatical structures, removing the bottleneck of designing language-specific alignment heuristics. We first train on sentence-level aligned data to learn speech translation at high latency, then apply a novel reinforcement learning strategy using GRPO to optimize latency while preserving translation quality. Hibiki-Zero achieves state-of-the-art performance in translation accuracy, latency, voice transfer, and naturalness across five X-to-English tasks. Moreover, we demonstrate that our model can be adapted to support a new input language with less than 1000h of speech. We provide examples, model weights, inference code and we release a benchmark containing 45h of multilingual data for speech translation evaluation.