🤖 AI Summary
Existing speech translation systems largely neglect paralinguistic information—such as emotion, intonation, and rhythm—resulting in flat, expressionless translations. To address this, we introduce the first multilingual movie speech alignment dataset explicitly designed for emotion and attitude expression, and propose an end-to-end speech translation framework that systematically models and preserves fine-grained non-linguistic features from the source speech. Methodologically, our approach employs discrete speech units, integrates multi-scale prosody encoding with cross-lingual prosody transfer, and incorporates contrastive learning alongside explicit duration alignment constraints. Experiments across multiple languages demonstrate significant improvements: +23.6% prosodic fidelity (MOS-P), +18.4% emotion recognition accuracy, while maintaining high translation quality (BLEU) and speech naturalness (MOS ≥ 4.1). Our work achieves a balanced triad of accuracy, naturalness, and expressiveness—setting a new benchmark for expressive speech translation.
📝 Abstract
Current research in speech-to-speech translation (S2ST) primarily concentrates on translation accuracy and speech naturalness, often overlooking key elements like paralinguistic information, which is essential for conveying emotions and attitudes in communication. To address this, our research introduces a novel, carefully curated multilingual dataset from various movie audio tracks. Each dataset pair is precisely matched for paralinguistic information and duration. We enhance this by integrating multiple prosody transfer techniques, aiming for translations that are accurate, natural-sounding, and rich in paralinguistic details. Our experimental results confirm that our model retains more paralinguistic information from the source speech while maintaining high standards of translation accuracy and naturalness.