🤖 AI Summary
Existing speech-to-text translation (S2TT) research is predominantly English-centric and unidirectional, severely constrained by the scarcity of parallel speech-text corpora, thus limiting scalability to many-to-many translation scenarios. To address this, we propose the first multimodal Chain-of-Thought (CoT) framework for S2TT, decomposing end-to-end speech translation into an interpretable two-stage collaborative reasoning process: “speech → semantics → text”. Methodologically, we introduce multimodal CoT prompting and a three-stage training strategy—ASR-guided pre-alignment, CoT instruction fine-tuning, and end-to-end reinforcement learning—while employing a small language model (SLM) for joint speech-text modeling. Our approach achieves new state-of-the-art results on CoVoST-2 and MuST-C benchmarks: +0.3 BLEU to 30.8 for en→ja, +2.5 to 47.7 for en→zh on CoVoST-2, and +1.6 to 21.2 for en→zh on MuST-C.
📝 Abstract
Speech Language Models (SLMs) have demonstrated impressive performance on speech translation tasks. However, existing research primarily focuses on direct instruction fine-tuning and often overlooks the inherent reasoning capabilities of SLMs. In this paper, we introduce a three-stage training framework designed to activate the chain-of-thought (CoT) capabilities of SLMs. We propose CoT-ST, a speech translation model that utilizes multimodal CoT to decompose speech translation into sequential steps of speech recognition and translation. We validated the effectiveness of our method on two datasets: the CoVoST-2 dataset and MuST-C dataset. The experimental results demonstrate that CoT-ST outperforms previous state-of-the-art methods, achieving higher BLEU scores (CoVoST-2 en-ja: 30.5->30.8, en-zh: 45.2->47.7, MuST-C en-zh: 19.6->21.2). This work is open sourced at https://github.com/X-LANCE/SLAM-LLM/tree/main/examples/st_covost2 .