S2ST-Omni: An Efficient and Scalable Multilingual Speech-to-Speech Translation Framework via Seamlessly Speech-Text Alignment and Streaming Speech Decoder

📅 2025-06-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Multilingual speech-to-speech translation (S2ST) faces two major challenges: the trade-off between high translation quality and low end-to-end latency, and heavy reliance on scarce parallel speech data. To address these, we propose an end-to-end decoupled framework that jointly models speech-to-text (S2TT) and text-to-speech (TTS) components. It employs a lightweight speech adapter to align cross-modal representations, integrates Whisper’s audio encoder with Qwen-3.0’s strong textual understanding, and introduces a streaming autoregressive TTS decoder to ensure real-time inference. Our approach drastically reduces dependence on parallel speech corpora while achieving state-of-the-art BLEU and COMET scores on the CVSS benchmark—outperforming existing S2ST systems. Crucially, its end-to-end latency matches that of the best-performing baselines, demonstrating strong practical deployability for real-world multilingual translation applications.

Technology Category

Application Category

📝 Abstract
Multilingual speech-to-speech translation (S2ST) aims to directly convert spoken utterances from multiple source languages into natural and intelligible speech in a target language. Despite recent progress, significant challenges remain: (1) achieving high-quality and low-latency S2ST remains a critical hurdle; (2) existing S2ST approaches heavily rely on large-scale parallel speech corpora, which are extremely difficult to collect. To address these issues, we propose S2ST-Omni, an efficient and scalable framework for multilingual speech-to-speech translation. Specifically, we decompose the S2ST task into speech-to-text translation (S2TT) and text-to-speech synthesis (TTS), unifying them within a single end-to-end speech-language model. To achieve high-quality S2TT while reducing dependence on parallel corpora, we leverage large-scale pretrained models -- Whisper for audio understanding and Qwen 3.0 for text understanding. A lightweight speech adapter is introduced to align speech and text representations, enabling effective use of pretrained multimodal knowledge. To ensure both translation quality and real-time performance, we adopt a pretrained streaming speech decoder in the TTS stage to generate target speech in an autoregressive manner. Extensive experiments on the CVSS benchmark demonstrate that S2ST-Omni outperforms state-of-the-art S2ST baselines while maintaining comparable latency, highlighting its effectiveness and practical potential for real-world deployment.
Problem

Research questions and friction points this paper is trying to address.

Achieving high-quality low-latency multilingual speech translation
Reducing reliance on scarce parallel speech corpora
Unifying speech-text alignment with streaming speech synthesis
Innovation

Methods, ideas, or system contributions that make the work stand out.

Seamless speech-text alignment via lightweight adapter
Leveraging pretrained Whisper and Qwen models
Streaming speech decoder for real-time TTS
🔎 Similar Papers
No similar papers found.