Efficient Data Selection for Domain Adaptation of ASR Using Pseudo-Labels and Multi-Stage Filtering

📅 2025-06-04

📈 Citations: 0

✨ Influential: 0

career value

208K/year

🤖 AI Summary

To address the scarcity of labeled data for domain adaptation of automatic speech recognition (ASR) models in resource-constrained scenarios, this paper proposes a multi-stage pseudo-label filtering framework that jointly leverages word error rate (WER) prediction, named entity recognition (NER), and character error rate (CER) analysis to significantly improve pseudo-label quality and selection robustness. The method generates initial pseudo-labels using Whisper (an encoder-decoder architecture) and Zipformer (a transcription-focused model), then filters 100 hours of high-quality utterances—just 1.4% of a 7,500-hour customer-service speech corpus. Fine-tuning an ASR model on this compact subset achieves a WER of 12.3%, matching performance attained when training on the full pseudo-labeled set. This framework establishes a scalable, low-cost paradigm for efficient ASR domain adaptation in low-resource, high-noise settings—particularly suitable for small-scale organizations deploying ASR in domains such as customer service.

Technology Category

Application Category

📝 Abstract

Fine-tuning pretrained ASR models for specific domains is challenging for small organizations with limited labeled data and computational resources. Here, we explore different data selection pipelines and propose a robust approach that improves ASR adaptation by filtering pseudo-labels generated using Whisper (encoder-decoder) and Zipformer (transducer) models. Our approach integrates multiple selection strategies -- including word error rate (WER) prediction, named entity recognition (NER), and character error rate (CER) analysis -- to extract high-quality training segments. We evaluate our method on Whisper and Zipformer using a 7500-hour baseline, comparing it to a CER-based approach relying on hypotheses from three ASR systems. Fine-tuning on 7500 hours of pseudo-labeled call center data achieves 12.3% WER, while our filtering reduces the dataset to 100 hours (1.4%) with similar performance; a similar trend is observed on Fisher English.

Problem

Research questions and friction points this paper is trying to address.

Improving ASR adaptation with limited labeled data

Filtering pseudo-labels for high-quality training segments

Reducing dataset size while maintaining performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses Whisper and Zipformer for pseudo-labeling

Multi-stage filtering with WER, NER, CER

Reduces dataset size while maintaining performance

🔎 Similar Papers

Personalized Speech Recognition for Children with Test-Time Adaptation