Cross-Lingual Transfer Learning for Speech Translation

📅 2024-07-01
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the scarcity of target-language data in multilingual speech translation by proposing a zero-shot cross-lingual transfer method based on the Whisper encoder. Conventional approaches require abundant target-language ASR or translation annotations, hindering deployment for low-resource languages. Method: We fine-tune the Whisper encoder exclusively on English–Chinese speech translation data to uncover intrinsic cross-lingual semantic alignment within its latent space; no target-language speech recognition or translation supervision is used. Alignment is empirically validated via speech–speech retrieval and shared embedding space analysis, while only lightweight decoder adaptation is performed. Results: The method achieves significant performance gains on English→Chinese translation and enables effective zero-shot transfer to unseen languages—including Japanese, French, and Spanish—without any target-language data. This work provides the first empirical evidence that speech encoder latent spaces inherently support unsupervised cross-lingual speech translation, establishing a novel paradigm for low-resource speech translation.

Technology Category

Application Category

📝 Abstract
There has been increasing interest in building multilingual foundation models for NLP and speech research. This paper examines how to expand the speech translation capability of these models with restricted data. Whisper, a speech foundation model with strong performance on speech recognition and English translation, is used as the example model. Using speech-to-speech retrieval to analyse the audio representations generated by the encoder, we show that utterances from different languages are mapped to a shared semantic space. This shared embedding space can then be leveraged for zero-shot cross-lingual transfer in speech translation. By fine-tuning the Whisper decoder with only English-to-Chinese speech translation data, improved performance for translation to Chinese can be obtained for multiple languages, in addition to English. Furthermore, for languages related to those seen in training it is possible to perform speech translation, despite the model never seeing the language in training, or being able to perform transcription.
Problem

Research questions and friction points this paper is trying to address.

Enhance speech translation with limited data
Leverage shared semantic space for cross-lingual transfer
Improve multilingual translation via fine-tuning Whisper decoder
Innovation

Methods, ideas, or system contributions that make the work stand out.

Cross-Lingual Transfer Learning
Speech-to-Speech Retrieval
Fine-Tuning Whisper Decoder
🔎 Similar Papers
No similar papers found.