CoT-ST: Enhancing LLM-based Speech Translation with Multimodal Chain-of-Thought

📅 2024-09-29
🏛️ arXiv.org
📈 Citations: 9
Influential: 0
📄 PDF
🤖 AI Summary
Existing speech-to-text translation (S2TT) research is predominantly English-centric and unidirectional, severely constrained by the scarcity of parallel speech-text corpora, thus limiting scalability to many-to-many translation scenarios. To address this, we propose the first multimodal Chain-of-Thought (CoT) framework for S2TT, decomposing end-to-end speech translation into an interpretable two-stage collaborative reasoning process: “speech → semantics → text”. Methodologically, we introduce multimodal CoT prompting and a three-stage training strategy—ASR-guided pre-alignment, CoT instruction fine-tuning, and end-to-end reinforcement learning—while employing a small language model (SLM) for joint speech-text modeling. Our approach achieves new state-of-the-art results on CoVoST-2 and MuST-C benchmarks: +0.3 BLEU to 30.8 for en→ja, +2.5 to 47.7 for en→zh on CoVoST-2, and +1.6 to 21.2 for en→zh on MuST-C.

Technology Category

Application Category

📝 Abstract
Speech Language Models (SLMs) have demonstrated impressive performance on speech translation tasks. However, existing research primarily focuses on direct instruction fine-tuning and often overlooks the inherent reasoning capabilities of SLMs. In this paper, we introduce a three-stage training framework designed to activate the chain-of-thought (CoT) capabilities of SLMs. We propose CoT-ST, a speech translation model that utilizes multimodal CoT to decompose speech translation into sequential steps of speech recognition and translation. We validated the effectiveness of our method on two datasets: the CoVoST-2 dataset and MuST-C dataset. The experimental results demonstrate that CoT-ST outperforms previous state-of-the-art methods, achieving higher BLEU scores (CoVoST-2 en-ja: 30.5->30.8, en-zh: 45.2->47.7, MuST-C en-zh: 19.6->21.2). This work is open sourced at https://github.com/X-LANCE/SLAM-LLM/tree/main/examples/st_covost2 .
Problem

Research questions and friction points this paper is trying to address.

Improving many-to-many speech-to-text translation with limited data
Adapting LLMs to low-resource S2TT tasks via curriculum learning
Achieving state-of-the-art performance with minimal speech data
Innovation

Methods, ideas, or system contributions that make the work stand out.

Three-stage curriculum learning strategy
Leverages LLMs for low-resource S2TT
State-of-the-art 15x14 language translation
🔎 Similar Papers
No similar papers found.
Yexing Du
Yexing Du
Harbin Institute of Technology
Z
Ziyang Ma
Shanghai Jiao Tong University
Y
Yifan Yang
Shanghai Jiao Tong University
Keqi Deng
Keqi Deng
University of Cambridge
Speech processingTranslationLarge language model
X
Xie Chen
Peng Cheng Laboratory
B
Bo Yang
Peng Cheng Laboratory
Y
Yang Xiang
Peng Cheng Laboratory
M
Ming Liu
Harbin Institute of Technology, Peng Cheng Laboratory
Bing Qin
Bing Qin
Professor in Harbin Institute of Technology
Natural Language ProcessingInformation ExtractionSentiment Analysis