Better Semi-supervised Learning for Multi-domain ASR Through Incremental Retraining and Data Filtering

📅 2025-06-05

📈 Citations: 0

✨ Influential: 0

career value

184K/year

🤖 AI Summary

To address the scarcity of in-domain labeled data in multi-domain automatic speech recognition (ASR), this paper proposes an incremental semi-supervised adaptation method. First, it initializes fine-tuning by jointly leveraging a small amount of in-domain labeled data and auxiliary data from semantically proximal source domains. Subsequently, it introduces a dual-path pseudo-label selection mechanism—integrating multi-model consensus and named entity recognition (NER)—to enable dynamic, high-confidence pseudo-label generation and iterative refinement. This work establishes the first incremental semi-supervised ASR training paradigm, effectively mitigating performance saturation while balancing accuracy and efficiency. Evaluated on the Wow and Fisher multi-domain benchmarks, the method achieves up to 22.3% (Wow) and 24.8% (Fisher) relative word error rate (WER) reduction over a single-step random pseudo-labeling baseline, demonstrating substantial improvements in cross-domain generalization.

Technology Category

Application Category

📝 Abstract

Fine-tuning pretrained ASR models for specific domains is challenging when labeled data is scarce. But unlabeled audio and labeled data from related domains are often available. We propose an incremental semi-supervised learning pipeline that first integrates a small in-domain labeled set and an auxiliary dataset from a closely related domain, achieving a relative improvement of 4% over no auxiliary data. Filtering based on multi-model consensus or named entity recognition (NER) is then applied to select and iteratively refine pseudo-labels, showing slower performance saturation compared to random selection. Evaluated on the multi-domain Wow call center and Fisher English corpora, it outperforms single-step fine-tuning. Consensus-based filtering outperforms other methods, providing up to 22.3% relative improvement on Wow and 24.8% on Fisher over single-step fine-tuning with random selection. NER is the second-best filter, providing competitive performance at a lower computational cost.

Problem

Research questions and friction points this paper is trying to address.

Improving multi-domain ASR with scarce labeled data

Leveraging unlabeled and related-domain data for semi-supervised learning

Enhancing pseudo-label selection via consensus and NER filtering

Innovation

Methods, ideas, or system contributions that make the work stand out.

Incremental semi-supervised learning pipeline

Multi-model consensus filtering for pseudo-labels

NER-based filtering for computational efficiency

🔎 Similar Papers

Personalized Speech Recognition for Children with Test-Time Adaptation