Revisiting Direct Speech-to-Text Translation with Speech LLMs: Better Scaling than CoT Prompting?

📅 2025-10-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study investigates the performance evolution of direct prompting versus chain-of-thought (CoT) prompting in speech-to-text translation (S2TT) across varying data scales. Leveraging large language models, we construct a multilingual pseudo-labeled S2TT dataset and systematically vary training data volume to evaluate the generalization and scalability of both prompting paradigms. Experimental results show that direct prompting exhibits consistent performance gains with increasing S2TT data, significantly outperforming CoT prompting in large-scale settings—challenging the prevailing assumption that CoT is inherently superior for complex multimodal tasks. Our key contribution is the first empirical demonstration that, under high-quality, multilingual S2TT supervision, direct prompting achieves superior scalability and practical utility. This finding provides new evidence supporting low-overhead, end-to-end S2TT paradigms without explicit reasoning scaffolding.

Technology Category

Application Category

📝 Abstract
Recent work on Speech-to-Text Translation (S2TT) has focused on LLM-based models, introducing the increasingly adopted Chain-of-Thought (CoT) prompting, where the model is guided to first transcribe the speech and then translate it. CoT typically outperforms direct prompting primarily because it can exploit abundant Automatic Speech Recognition (ASR) and Text-to-Text Translation (T2TT) datasets to explicitly model its steps. In this paper, we systematically compare CoT and Direct prompting under increasing amounts of S2TT data. To this end, we pseudo-label an ASR corpus by translating its transcriptions into six European languages, and train LLM-based S2TT systems with both prompting strategies at different data scales. Our results show that Direct improves more consistently as the amount of data increases, suggesting that it may become a more effective approach as larger S2TT resources are created.
Problem

Research questions and friction points this paper is trying to address.

Comparing direct versus chain-of-thought prompting for speech translation
Investigating scaling effects of data volume on translation methods
Evaluating performance of speech LLMs with pseudo-labeled multilingual data
Innovation

Methods, ideas, or system contributions that make the work stand out.

Direct prompting scales better than CoT
Pseudo-labeled ASR corpus for multilingual training
Data scaling makes Direct prompting more effective
🔎 Similar Papers
No similar papers found.