High-Fidelity Simultaneous Speech-To-Speech Translation

📅 2025-02-05

📈 Citations: 0

✨ Influential: 0

career value

153K/year

🤖 AI Summary

This work addresses the dual challenges of synchronous source/target speech modeling and adaptive latency control in real-time speech-to-speech translation. We propose the first end-to-end speech-to-speech joint generation framework, built upon a decoder-only multi-stream language model that jointly encodes text and HuBERT+VQVAE-derived speech tokens to enable simultaneous listening and translating. We introduce a novel weakly supervised, word-level latency modeling approach: alignment-free synthetic data is constructed using translation perplexity, enabling dynamic latency optimization without explicit alignment or complex scheduling. Additionally, temperature-based sampling is incorporated to enhance speech naturalness and speaker fidelity. Evaluated on French–English real-time speech translation, our method achieves state-of-the-art performance—significantly improving BLEU scores, delivering superior speech quality, and supporting both batched inference and low-latency edge deployment.

Technology Category

Application Category

📝 Abstract

We introduce Hibiki, a decoder-only model for simultaneous speech translation. Hibiki leverages a multistream language model to synchronously process source and target speech, and jointly produces text and audio tokens to perform speech-to-text and speech-to-speech translation. We furthermore address the fundamental challenge of simultaneous interpretation, which unlike its consecutive counterpart, where one waits for the end of the source utterance to start translating, adapts its flow to accumulate just enough context to produce a correct translation in real-time, chunk by chunk. To do so, we introduce a weakly-supervised method that leverages the perplexity of an off-the-shelf text translation system to identify optimal delays on a per-word basis and create aligned synthetic data. After supervised training, Hibiki performs adaptive, simultaneous speech translation with vanilla temperature sampling. On a French-English simultaneous speech translation task, Hibiki demonstrates state-of-the-art performance in translation quality, speaker fidelity and naturalness. Moreover, the simplicity of its inference process makes it compatible with batched translation and even real-time on-device deployment. We provide examples as well as models and inference code.

Problem

Research questions and friction points this paper is trying to address.

Simultaneous speech-to-speech translation

Real-time context accumulation

Optimal delay identification

Innovation

Methods, ideas, or system contributions that make the work stand out.

Decoder-only model for translation

Multistream language model processing

Weakly-supervised optimal delay identification

🔎 Similar Papers

No similar papers found.

Together AI

$200,000 - $260,000 + equity + benefits

San Francisco / San Francisco, San Francisco, California, United States

2026 Fall Applied Science Internship - Natural Language Processing and Speech Technologies - United States, PhD Student Science Recruiting

Amazon

Bellevue, WA / Boston, MA / Cambridge, MA

Authors to Follow