Zero-resource Speech Translation and Recognition with LLMs

📅 2024-12-24

📈 Citations: 0

✨ Influential: 0

career value

173K/year

🤖 AI Summary

This work addresses zero-resource speech translation (ST) and automatic speech recognition (ASR), where no paired speech–text data is available. We propose a cross-lingual speech understanding framework that decouples a pretrained multilingual speech encoder (e.g., XLS-R) from a purely text-based multilingual large language model (LLM) (e.g., BLOOMZ or Phi-3). A learnable linear or MLP adapter maps speech embeddings into the LLM’s token embedding space, and contrastive alignment combined with instruction tuning enables zero-shot transfer. Our key contributions are: (i) the first direct use of purely text-based multilingual LLMs for zero-shot ST/ASR; (ii) elimination of reliance on speech–text parallel data; and (iii) empirical demonstration that the LLM’s target-language generation capability constitutes the performance upper bound. On CoVoST2, our method achieves BLEU > 23 for ST and WER = 28.2% for ASR on two unseen languages—without any language-specific fine-tuning or paired supervision.

Technology Category

Application Category

📝 Abstract

Despite recent advancements in speech processing, zero-resource speech translation (ST) and automatic speech recognition (ASR) remain challenging problems. In this work, we propose to leverage a multilingual Large Language Model (LLM) to perform ST and ASR in languages for which the model has never seen paired audio-text data. We achieve this by using a pre-trained multilingual speech encoder, a multilingual LLM, and a lightweight adaptation module that maps the audio representations to the token embedding space of the LLM. We perform several experiments both in ST and ASR to understand how to best train the model and what data has the most impact on performance in previously unseen languages. In ST, our best model is capable to achieve BLEU scores over 23 in CoVoST2 for two previously unseen languages, while in ASR, we achieve WERs of up to 28.2%. We finally show that the performance of our system is bounded by the ability of the LLM to output text in the desired language.

Problem

Research questions and friction points this paper is trying to address.

Speech Translation

Automatic Transcription

Unseen Languages

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multilingual Pre-trained Model

Unseen Language Processing

Speech Translation and Transcription

🔎 Similar Papers

No similar papers found.