Listening or Reading? Evaluating Speech Awareness in Chain-of-Thought Speech-to-Text Translation

📅 2025-10-03

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

Current cascaded speech-to-text translation (S2TT) systems suffer from two key limitations: error propagation and insufficient exploitation of acoustic cues. Although recent chain-of-thought (CoT) prompting approaches aim to jointly model speech and text, our attribution analysis reveals they remain heavily reliant on intermediate transcriptions, exhibiting pseudo-multimodal cascaded behavior. To address this, we propose two innovations: (1) incorporating direct speech supervision—i.e., end-to-end S2TT data—to strengthen speech perception, and (2) designing a noise-injection and mixed-training strategy to improve acoustic robustness. Experiments demonstrate that our method significantly enhances speech attribution, yielding substantial improvements in prosody awareness and noise-robustness evaluations. These results validate a simple yet effective pathway for enhancing speech perception in S2TT systems.

Technology Category

Application Category

📝 Abstract

Speech-to-Text Translation (S2TT) systems built from Automatic Speech Recognition (ASR) and Text-to-Text Translation (T2TT) modules face two major limitations: error propagation and the inability to exploit prosodic or other acoustic cues. Chain-of-Thought (CoT) prompting has recently been introduced, with the expectation that jointly accessing speech and transcription will overcome these issues. Analyzing CoT through attribution methods, robustness evaluations with corrupted transcripts, and prosody-awareness, we find that it largely mirrors cascaded behavior, relying mainly on transcripts while barely leveraging speech. Simple training interventions, such as adding Direct S2TT data or noisy transcript injection, enhance robustness and increase speech attribution. These findings challenge the assumed advantages of CoT and highlight the need for architectures that explicitly integrate acoustic information into translation.

Problem

Research questions and friction points this paper is trying to address.

Evaluating speech awareness in chain-of-thought speech translation

Assessing reliance on transcripts versus acoustic cues

Addressing error propagation in cascaded translation systems

Innovation

Methods, ideas, or system contributions that make the work stand out.

Chain-of-Thought prompting accesses speech and transcription

Training interventions enhance robustness and speech attribution

Architectures need explicit acoustic information integration

🔎 Similar Papers

Chain-of-Thought Prompting for Speech Translation