KIT's Low-resource Speech Translation Systems for IWSLT2025: System Enhancement with Synthetic Data and Model Regularization

📅 2025-05-26

📈 Citations: 0

✨ Influential: 0

career value

181K/year

🤖 AI Summary

This study addresses low-resource speech-to-text translation (ST) from Bemba, North Levantine Arabic, and Tunisian Arabic to English, where genuine parallel ST data is unavailable. To overcome performance bottlenecks, we propose four key techniques: (1) TTS-based synthetic speech data generation for ASR and end-to-end ST training; (2) machine translation (MT)-augmented ST modeling to improve translation fidelity; (3) intra-task distillation to jointly optimize ASR, MT, and ST submodules; and (4) minimum Bayes risk (MBR) decoding for fusion. Experiments demonstrate that an end-to-end ST system trained solely on synthetic data for North Levantine Arabic surpasses a cascade system trained on real data—a first such result. Synthetic data significantly boosts ASR and ST performance for Bemba. Intra-task distillation yields consistent gains across all three languages. MBR fusion delivers an average improvement of approximately 1.5 BLEU points.

Technology Category

Application Category

📝 Abstract

This paper presents KIT's submissions to the IWSLT 2025 low-resource track. We develop both cascaded systems, consisting of Automatic Speech Recognition (ASR) and Machine Translation (MT) models, and end-to-end (E2E) Speech Translation (ST) systems for three language pairs: Bemba, North Levantine Arabic, and Tunisian Arabic into English. Building upon pre-trained models, we fine-tune our systems with different strategies to utilize resources efficiently. This study further explores system enhancement with synthetic data and model regularization. Specifically, we investigate MT-augmented ST by generating translations from ASR data using MT models. For North Levantine, which lacks parallel ST training data, a system trained solely on synthetic data slightly surpasses the cascaded system trained on real data. We also explore augmentation using text-to-speech models by generating synthetic speech from MT data, demonstrating the benefits of synthetic data in improving both ASR and ST performance for Bemba. Additionally, we apply intra-distillation to enhance model performance. Our experiments show that this approach consistently improves results across ASR, MT, and ST tasks, as well as across different pre-trained models. Finally, we apply Minimum Bayes Risk decoding to combine the cascaded and end-to-end systems, achieving an improvement of approximately 1.5 BLEU points.

Problem

Research questions and friction points this paper is trying to address.

Enhancing low-resource speech translation with synthetic data

Improving ASR and ST performance via model regularization

Combining cascaded and end-to-end systems for better BLEU scores

Innovation

Methods, ideas, or system contributions that make the work stand out.

Utilizing synthetic data from ASR and MT models

Applying intra-distillation to enhance model performance

Combining cascaded and E2E systems with MBR decoding

🔎 Similar Papers

Improving Speech Emotion Recognition in Under-Resourced Languages via Speech-to-Speech Translation with Bootstrapping Data Selection