CoT-ST: Enhancing LLM-based Speech Translation with Multimodal Chain-of-Thought

📅 2024-09-29

🏛️ arXiv.org

📈 Citations: 9

✨ Influential: 0

career value

168K/year

🤖 AI Summary

Existing speech-to-text translation (S2TT) research is predominantly English-centric and unidirectional, severely constrained by the scarcity of parallel speech-text corpora, thus limiting scalability to many-to-many translation scenarios. To address this, we propose the first multimodal Chain-of-Thought (CoT) framework for S2TT, decomposing end-to-end speech translation into an interpretable two-stage collaborative reasoning process: “speech → semantics → text”. Methodologically, we introduce multimodal CoT prompting and a three-stage training strategy—ASR-guided pre-alignment, CoT instruction fine-tuning, and end-to-end reinforcement learning—while employing a small language model (SLM) for joint speech-text modeling. Our approach achieves new state-of-the-art results on CoVoST-2 and MuST-C benchmarks: +0.3 BLEU to 30.8 for en→ja, +2.5 to 47.7 for en→zh on CoVoST-2, and +1.6 to 21.2 for en→zh on MuST-C.

Technology Category

Application Category

📝 Abstract

Speech Language Models (SLMs) have demonstrated impressive performance on speech translation tasks. However, existing research primarily focuses on direct instruction fine-tuning and often overlooks the inherent reasoning capabilities of SLMs. In this paper, we introduce a three-stage training framework designed to activate the chain-of-thought (CoT) capabilities of SLMs. We propose CoT-ST, a speech translation model that utilizes multimodal CoT to decompose speech translation into sequential steps of speech recognition and translation. We validated the effectiveness of our method on two datasets: the CoVoST-2 dataset and MuST-C dataset. The experimental results demonstrate that CoT-ST outperforms previous state-of-the-art methods, achieving higher BLEU scores (CoVoST-2 en-ja: 30.5->30.8, en-zh: 45.2->47.7, MuST-C en-zh: 19.6->21.2). This work is open sourced at https://github.com/X-LANCE/SLAM-LLM/tree/main/examples/st_covost2 .

Problem

Research questions and friction points this paper is trying to address.

Improving many-to-many speech-to-text translation with limited data

Adapting LLMs to low-resource S2TT tasks via curriculum learning

Achieving state-of-the-art performance with minimal speech data

Innovation

Methods, ideas, or system contributions that make the work stand out.

Three-stage curriculum learning strategy

Leverages LLMs for low-resource S2TT

State-of-the-art 15x14 language translation

🔎 Similar Papers

Chain-of-Thought Prompting for Speech Translation