When End-to-End is Overkill: Rethinking Cascaded Speech-to-Text Translation

📅 2025-02-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In cascaded speech translation, severe error propagation between ASR and MT modules stems primarily from semantic divergence—where acoustically similar speech segments map to semantically dissimilar translations. To address this, we propose a novel multi-candidate self-supervised speech feature joint modeling paradigm. Our approach integrates ASR n-best hypotheses with self-supervised speech representations (e.g., wav2vec 2.0), leveraging conditional sequence modeling and cross-modal attention to enhance MT’s robustness to speech uncertainty. Evaluated on multilingual benchmarks including LibriSpeech→Europarl, our method significantly reduces error propagation, achieving lower WER and higher BLEU than strong end-to-end baselines. Crucially, it preserves model reusability—enabling plug-and-play integration with existing ASR and MT components—while maintaining high data efficiency. This work provides both a diagnostic insight into cascade failure modes and a practical, modular solution for robust speech translation.

Technology Category

Application Category

📝 Abstract
Though end-to-end speech-to-text translation has been a great success, we argue that the cascaded speech-to-text translation model still has its place, which is usually criticized for the error propagation between automatic speech recognition (ASR) and machine translation (MT) models. In this paper, we explore the benefits of incorporating multiple candidates from ASR and self-supervised speech features into MT. Our analysis reveals that the primary cause of cascading errors stems from the increased divergence between similar samples in the speech domain when mapped to the text domain. By including multiple candidates and self-supervised speech features, our approach allows the machine translation model to choose the right words and ensure precise translation using various speech samples. This strategy minimizes error spread and takes advantage of large ASR and MT datasets, along with pre-trained ASR/MT models, while addressing associated issues.
Problem

Research questions and friction points this paper is trying to address.

Speech-to-Text
Translation Accuracy
Error Propagation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Speech Recognition Variability
Integrated Self-learning Features
Error Propagation Reduction
🔎 Similar Papers
No similar papers found.
A
Anna Min
School of Software, Tsinghua University, Beijing, China
Chenxu Hu
Chenxu Hu
Tsinghua University
Multimodal LearningLarge Language ModelsSpeechAudio Signal ProcessingComputer Vision
Y
Yi Ren
TickTok
H
Hang Zhao
IIIS, Tsinghua University, Beijing, China