Efficient Data Selection for Domain Adaptation of ASR Using Pseudo-Labels and Multi-Stage Filtering

📅 2025-06-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the scarcity of labeled data for domain adaptation of automatic speech recognition (ASR) models in resource-constrained scenarios, this paper proposes a multi-stage pseudo-label filtering framework that jointly leverages word error rate (WER) prediction, named entity recognition (NER), and character error rate (CER) analysis to significantly improve pseudo-label quality and selection robustness. The method generates initial pseudo-labels using Whisper (an encoder-decoder architecture) and Zipformer (a transcription-focused model), then filters 100 hours of high-quality utterances—just 1.4% of a 7,500-hour customer-service speech corpus. Fine-tuning an ASR model on this compact subset achieves a WER of 12.3%, matching performance attained when training on the full pseudo-labeled set. This framework establishes a scalable, low-cost paradigm for efficient ASR domain adaptation in low-resource, high-noise settings—particularly suitable for small-scale organizations deploying ASR in domains such as customer service.

Technology Category

Application Category

📝 Abstract
Fine-tuning pretrained ASR models for specific domains is challenging for small organizations with limited labeled data and computational resources. Here, we explore different data selection pipelines and propose a robust approach that improves ASR adaptation by filtering pseudo-labels generated using Whisper (encoder-decoder) and Zipformer (transducer) models. Our approach integrates multiple selection strategies -- including word error rate (WER) prediction, named entity recognition (NER), and character error rate (CER) analysis -- to extract high-quality training segments. We evaluate our method on Whisper and Zipformer using a 7500-hour baseline, comparing it to a CER-based approach relying on hypotheses from three ASR systems. Fine-tuning on 7500 hours of pseudo-labeled call center data achieves 12.3% WER, while our filtering reduces the dataset to 100 hours (1.4%) with similar performance; a similar trend is observed on Fisher English.
Problem

Research questions and friction points this paper is trying to address.

Improving ASR adaptation with limited labeled data
Filtering pseudo-labels for high-quality training segments
Reducing dataset size while maintaining performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses Whisper and Zipformer for pseudo-labeling
Multi-stage filtering with WER, NER, CER
Reduces dataset size while maintaining performance
🔎 Similar Papers
No similar papers found.
Pradeep Rangappa
Pradeep Rangappa
Senior Speech Applied Scientist (Remote) @Omilia | Postdoc Idiap | Ex- Swiggy | PhD IIT Kharagpur
Speech RecognitionMachine LearningSpeaker Diarization
A
Andrés Carofilis
Idiap Research Institute, Switzerland
J
Jeena Prakash
Uniphore Systems, India & USA
S
Shashi Kumar
Idiap Research Institute, Switzerland; EPFL, Switzerland
Sergio Burdisso
Sergio Burdisso
Researcher, Idiap Research Institute
artificial intelligencemachine learningnatural language processing
S
S. Madikeri
University of Zurich, Switzerland
E
Esaú Villatoro-Tello
Idiap Research Institute, Switzerland
Bidisha Sharma
Bidisha Sharma
Senior AI Scientist, Uniphore
Spoken Language UnderstandingSpeech SynthesisSpeech EnhancementMusic ProcessingMusic
Petr Motlicek
Petr Motlicek
Idiap Research Institute
Artificial intelligencespeech and signal processingmachine learning
K
Kadri Hacioğlu
Uniphore Systems, India & USA
S
Shankar Venkatesan
Uniphore Systems, India & USA
Saurabh Vyas
Saurabh Vyas
Uniphore Systems, India & USA
A
A. Stolcke
Uniphore Systems, India & USA