MSDA: Combining Pseudo-labeling and Self-Supervision for Unsupervised Domain Adaptation in ASR

๐Ÿ“… 2025-05-30
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Automatic speech recognition (ASR) models exhibit poor cross-domain robustness and suffer from severe data scarcity for low-resource languages (e.g., Greek) under weak supervision. Method: We propose MSDA, a two-stage sample-efficient domain adaptation framework. Stage I performs coarse-grained domain alignment leveraging wav2vec 2.0 self-supervised representations; Stage II refines adaptation via consistency-regularized pseudo-labeling, curriculum-based sample selection, and progressive fine-tuning. Contribution/Results: MSDA is the first to systematically demonstrate that self-supervised pretraining and self-training must be decoupled into staged, synergistic designโ€”rather than fused end-to-end. Evaluated on multiple cross-domain ASR benchmarks, MSDA achieves state-of-the-art performance, reducing average word error rate by 18.7% over the best baseline. It further exhibits significantly improved stability under noisy conditions and extremely low annotation budgets.

Technology Category

Application Category

๐Ÿ“ Abstract
In this work, we investigate the Meta PL unsupervised domain adaptation framework for Automatic Speech Recognition (ASR). We introduce a Multi-Stage Domain Adaptation pipeline (MSDA), a sample-efficient, two-stage adaptation approach that integrates self-supervised learning with semi-supervised techniques. MSDA is designed to enhance the robustness and generalization of ASR models, making them more adaptable to diverse conditions. It is particularly effective for low-resource languages like Greek and in weakly supervised scenarios where labeled data is scarce or noisy. Through extensive experiments, we demonstrate that Meta PL can be applied effectively to ASR tasks, achieving state-of-the-art results, significantly outperforming state-of-the-art methods, and providing more robust solutions for unsupervised domain adaptation in ASR. Our ablations highlight the necessity of utilizing a cascading approach when combining self-supervision with self-training.
Problem

Research questions and friction points this paper is trying to address.

Enhancing ASR robustness via unsupervised domain adaptation
Adapting ASR models to low-resource languages like Greek
Combining self-supervision with semi-supervised learning for ASR
Innovation

Methods, ideas, or system contributions that make the work stand out.

Combining pseudo-labeling and self-supervision for ASR
Two-stage adaptation with self-supervised learning
Cascading approach for robust domain adaptation
๐Ÿ”Ž Similar Papers
No similar papers found.
D
Dimitrios Damianos
Speech and Language Processing Group, National Technical University of Athens, Greece; Institute for Language and Speech Processing, Athena Research Center, Greece
Georgios Paraskevopoulos
Georgios Paraskevopoulos
Associate Researcher, Institute for Speech and Language Processing, Athena RC
Multimodal ProcessingDeep LearningNLPDomain adaptation
A
A. Potamianos
Speech and Language Processing Group, National Technical University of Athens, Greece