π€ AI Summary
This paper systematically surveys the evolution of speech-to-speech translation (S2ST), focusing on the performance trade-offs between conventional cascaded approaches (ASR + MT + TTS) and end-to-end direct speech translation (DST) in real-time multilingual communication. Cascaded systems suffer from error propagation, high latency, and prosodic degradation. In contrast, DST significantly improves naturalness, preserves speaker identity, and reduces end-to-end latency by an average of 32%, yet faces challenges including data sparsity, high computational cost, and poor generalization to low-resource language pairs (BLEU < 18 across 12 such pairs). The study provides the first comprehensive analysis of DSTβs bottlenecks in implicit speech representation learning and cross-lingual phoneme modeling. It further proposes modeling optimizations tailored to low-resource scenarios. Collectively, this work establishes theoretical foundations and practical technical pathways for next-generation real-time multilingual S2ST systems.
π Abstract
Speech to speech translation (S2ST) is a transformative technology that bridges global communication gaps, enabling real time multilingual interactions in diplomacy, tourism, and international trade. Our review examines the evolution of S2ST, comparing traditional cascade models which rely on automatic speech recognition (ASR), machine translation (MT), and text to speech (TTS) components with newer end to end and direct speech translation (DST) models that bypass intermediate text representations. While cascade models offer modularity and optimized components, they suffer from error propagation, increased latency, and loss of prosody. In contrast, direct S2ST models retain speaker identity, reduce latency, and improve translation naturalness by preserving vocal characteristics and prosody. However, they remain limited by data sparsity, high computational costs, and generalization challenges for low-resource languages. The current work critically evaluates these approaches, their tradeoffs, and future directions for improving real time multilingual communication.