🤖 AI Summary
Current cascaded speech-to-text translation (S2TT) systems suffer from two key limitations: error propagation and insufficient exploitation of acoustic cues. Although recent chain-of-thought (CoT) prompting approaches aim to jointly model speech and text, our attribution analysis reveals they remain heavily reliant on intermediate transcriptions, exhibiting pseudo-multimodal cascaded behavior. To address this, we propose two innovations: (1) incorporating direct speech supervision—i.e., end-to-end S2TT data—to strengthen speech perception, and (2) designing a noise-injection and mixed-training strategy to improve acoustic robustness. Experiments demonstrate that our method significantly enhances speech attribution, yielding substantial improvements in prosody awareness and noise-robustness evaluations. These results validate a simple yet effective pathway for enhancing speech perception in S2TT systems.
📝 Abstract
Speech-to-Text Translation (S2TT) systems built from Automatic Speech Recognition (ASR) and Text-to-Text Translation (T2TT) modules face two major limitations: error propagation and the inability to exploit prosodic or other acoustic cues. Chain-of-Thought (CoT) prompting has recently been introduced, with the expectation that jointly accessing speech and transcription will overcome these issues. Analyzing CoT through attribution methods, robustness evaluations with corrupted transcripts, and prosody-awareness, we find that it largely mirrors cascaded behavior, relying mainly on transcripts while barely leveraging speech. Simple training interventions, such as adding Direct S2TT data or noisy transcript injection, enhance robustness and increase speech attribution. These findings challenge the assumed advantages of CoT and highlight the need for architectures that explicitly integrate acoustic information into translation.