🤖 AI Summary
This study investigates the performance evolution of direct prompting versus chain-of-thought (CoT) prompting in speech-to-text translation (S2TT) across varying data scales. Leveraging large language models, we construct a multilingual pseudo-labeled S2TT dataset and systematically vary training data volume to evaluate the generalization and scalability of both prompting paradigms. Experimental results show that direct prompting exhibits consistent performance gains with increasing S2TT data, significantly outperforming CoT prompting in large-scale settings—challenging the prevailing assumption that CoT is inherently superior for complex multimodal tasks. Our key contribution is the first empirical demonstration that, under high-quality, multilingual S2TT supervision, direct prompting achieves superior scalability and practical utility. This finding provides new evidence supporting low-overhead, end-to-end S2TT paradigms without explicit reasoning scaffolding.
📝 Abstract
Recent work on Speech-to-Text Translation (S2TT) has focused on LLM-based models, introducing the increasingly adopted Chain-of-Thought (CoT) prompting, where the model is guided to first transcribe the speech and then translate it. CoT typically outperforms direct prompting primarily because it can exploit abundant Automatic Speech Recognition (ASR) and Text-to-Text Translation (T2TT) datasets to explicitly model its steps. In this paper, we systematically compare CoT and Direct prompting under increasing amounts of S2TT data. To this end, we pseudo-label an ASR corpus by translating its transcriptions into six European languages, and train LLM-based S2TT systems with both prompting strategies at different data scales. Our results show that Direct improves more consistently as the amount of data increases, suggesting that it may become a more effective approach as larger S2TT resources are created.