A Unit-based System and Dataset for Expressive Direct Speech-to-Speech Translation

📅 2024-09-01
🏛️ Interspeech
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing speech translation systems largely neglect paralinguistic information—such as emotion, intonation, and rhythm—resulting in flat, expressionless translations. To address this, we introduce the first multilingual movie speech alignment dataset explicitly designed for emotion and attitude expression, and propose an end-to-end speech translation framework that systematically models and preserves fine-grained non-linguistic features from the source speech. Methodologically, our approach employs discrete speech units, integrates multi-scale prosody encoding with cross-lingual prosody transfer, and incorporates contrastive learning alongside explicit duration alignment constraints. Experiments across multiple languages demonstrate significant improvements: +23.6% prosodic fidelity (MOS-P), +18.4% emotion recognition accuracy, while maintaining high translation quality (BLEU) and speech naturalness (MOS ≥ 4.1). Our work achieves a balanced triad of accuracy, naturalness, and expressiveness—setting a new benchmark for expressive speech translation.

Technology Category

Application Category

📝 Abstract
Current research in speech-to-speech translation (S2ST) primarily concentrates on translation accuracy and speech naturalness, often overlooking key elements like paralinguistic information, which is essential for conveying emotions and attitudes in communication. To address this, our research introduces a novel, carefully curated multilingual dataset from various movie audio tracks. Each dataset pair is precisely matched for paralinguistic information and duration. We enhance this by integrating multiple prosody transfer techniques, aiming for translations that are accurate, natural-sounding, and rich in paralinguistic details. Our experimental results confirm that our model retains more paralinguistic information from the source speech while maintaining high standards of translation accuracy and naturalness.
Problem

Research questions and friction points this paper is trying to address.

Speech Translation
Paralinguistics
Emotional Expression
Innovation

Methods, ideas, or system contributions that make the work stand out.

Emotion-Preserving Translation
Multilingual Dataset
Paralinguistic Information Retention
🔎 Similar Papers
No similar papers found.
A
Anna Min
Tsinghua University, China
Chenxu Hu
Chenxu Hu
Tsinghua University
Multimodal LearningLarge Language ModelsSpeechAudio Signal ProcessingComputer Vision
Y
Yi Ren
ByteDance, China
H
Hang Zhao
Tsinghua University, China