Chain-of-Thought Prompting for Speech Translation

πŸ“… 2024-09-17
πŸ›οΈ IEEE International Conference on Acoustics, Speech, and Signal Processing
πŸ“ˆ Citations: 1
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
To address semantic distortion and translation inconsistency arising from end-to-end modeling in automatic speech translation (AST), this paper proposes a two-stage chain-of-thought (CoT)-guided prompting method: it explicitly leverages ASR transcripts as intermediate representations and jointly encodes them with speech features to establish a stepwise reasoning pathβ€”speech β†’ transcript β†’ translation. This work is the first to introduce the CoT paradigm into AST. We instantiate it via a Speech-LLM built upon the Megatron-T5 architecture, integrating a speech encoder with an encoder-decoder framework and employing LoRA for efficient fine-tuning. Evaluated on six En↔X bilingual AST benchmarks, our method achieves an average +2.4 BLEU gain over pure speech prompting and outperforms the ASR+AST cascade baseline by +2.0 BLEU, demonstrating substantial improvements in both translation accuracy and consistency.

Technology Category

Application Category

πŸ“ Abstract
Large language models (LLMs) have demonstrated remarkable advancements in language understanding and generation. Building on the success of text-based LLMs, recent research has adapted these models to use speech embeddings for prompting, resulting in Speech-LLM models that exhibit strong performance in automatic speech recognition (ASR) and automatic speech translation (AST). In this work, we propose a novel approach to leverage ASR transcripts as prompts for AST in a Speech-LLM built on an encoder-decoder text LLM. The Speech-LLM model consists of a speech encoder and an encoder-decoder structure Megatron-T5. By first decoding speech to generate ASR transcripts and subsequently using these transcripts along with encoded speech for prompting, we guide the speech translation in a two-step process like chain-of-thought (CoT) prompting. Low-rank adaptation (LoRA) is used for the T5 LLM for model adaptation and shows superior performance to full model fine-tuning. Experimental results show that the proposed CoT prompting significantly improves AST performance, achieving an average increase of 2.4 BLEU points across 6 En->X or X->En AST tasks compared to speech prompting alone. Additionally, compared to a related CoT prediction method that predicts a concatenated sequence of ASR and AST transcripts, our method performs better by an average of 2 BLEU points.
Problem

Research questions and friction points this paper is trying to address.

Improving speech translation using ASR transcripts as prompts
Enhancing AST performance via chain-of-thought prompting
Adapting T5 LLM with LoRA for superior speech translation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Speech-LLM with ASR transcripts for AST
Chain-of-Thought prompting in two steps
Low-rank adaptation for model efficiency
πŸ”Ž Similar Papers
No similar papers found.