🤖 AI Summary
This work addresses the Interspeech 2025 ML-SUPERB 2.0 challenge, targeting efficient zero-shot language identification (LID) and multilingual automatic speech recognition (ASR) for low-resource languages. To meet stringent requirements for zero-shot generalization and rapid language adaptation, we propose a lightweight, unified framework featuring a novel hybrid LID architecture—comprising a shared encoder and language-specialized binary language models—and dynamically integrating three complementary components: fine-tuned SeamlessM4T, MMS-1B-all language adapters, and MMS zero-shot transfer. Leveraging pretrained language embeddings and cross-lingual shared representations, our approach significantly enhances zero-shot generalization capability and fine-tuning efficiency. The system enables adaptive, language-specific deployment without architectural modification. Evaluated on the ML-SUPERB 2.0 benchmark, it achieves state-of-the-art performance, ranking first overall.
📝 Abstract
This paper describes the language identification and multilingual speech recognition system developed at Tallinn University of Technology for the Interspeech 2025 ML-SUPERB 2.0 Challenge. A hybrid language identification system is used, consisting of a pretrained language embedding model and a light-weight speech recognition model with a shared encoder across languages and language-specific bigram language models. For speech recognition, three models are used, where only a single model is applied for each language, depending on the training data availability and performance on held-out data. The model set consists of a finetuned version of SeamlessM4T, MMS-1B-all with custom language adapters and MMS-zeroshot. The system obtained the top overall score in the challenge.