RosettaSpeech: Zero-Shot Speech-to-Speech Translation from Monolingual Data

📅 2025-11-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Speech-to-speech translation (S2ST) has long been hindered by the scarcity of parallel speech data. Existing approaches typically rely on cascaded pipelines or bilingual speech pairs. This paper proposes a zero-shot S2ST framework that requires only monolingual speech-text data and machine-translated text pairs, using text as an intermediate modality to enable end-to-end direct speech-to-speech translation—marking the first approach fully independent of bilingual speech data. Our method integrates text-supervised training, monolingual speech encoding, cross-modal alignment, and an end-to-end neural architecture, supporting efficient multilingual-to-monolingual translation. Evaluated on CVSS-C, our model achieves ASR-BLEU scores of 25.17 (German→English) and 29.86 (Spanish→English), outperforming strong baselines by 27% and 14%, respectively. These results empirically validate the positive correlation between data scale and translation performance.

Technology Category

Application Category

📝 Abstract
The scarcity of parallel speech corpora critically hampers speech-to-speech translation (S2ST), often forcing reliance on complex, multi-stage pipelines. This paper introduces RosettaSpeech, a novel and simplified framework for zero-shot S2ST that is trained on monolingual speech-text data augmented by machine translation supervision. While our method leverages the linguistic knowledge inherent in text-based NMT models, it strictly eliminates the need for parallel speech-to-speech pairs. Our model uniquely uses text as an intermediate bridge during training but functions as a direct, end-to-end speech-to-speech model at inference. This streamlined approach achieves state-of-the-art results on standard benchmarks. For instance, on the CVSS-C test set, RosettaSpeech outperforms leading systems, achieving an ASR-BLEU score of 25.17 for German-to-English and 29.86 for Spanish-to-English-relative gains of over 27% and 14%, respectively. Furthermore, we demonstrate that a single model can deliver strong many-to-one translation performance (FR/ES/DE -> EN). We also provide a foundational analysis of how training data scaling impacts model performance. By prioritizing reliance on abundant parallel text rather than difficult-to-acquire parallel speech, RosettaSpeech offers a scalable path to creating high-quality, speaker-preserving S2ST for a much broader array of languages.
Problem

Research questions and friction points this paper is trying to address.

Addresses scarcity of parallel speech data for translation
Enables zero-shot speech translation using monolingual data
Eliminates need for parallel speech pairs during training
Innovation

Methods, ideas, or system contributions that make the work stand out.

Zero-shot speech translation without parallel speech data
Training uses monolingual speech-text with translation supervision
Text-bridged training enables direct speech-to-speech inference
🔎 Similar Papers