A Domain Adaptation Framework for Speech Recognition Systems with Only Synthetic data

๐Ÿ“… 2025-01-21
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
To address the degradation of ASR model performance across domains due to scarcity of in-domain speech data, this paper proposes DASโ€”a zero-shot, real-label-free domain adaptation framework for Whisper, leveraging solely synthetic speech for adaptation to target domains such as music, weather, and sports. Methodologically, DAS introduces (i) an LLM-driven pipeline for domain-specific text generation and TTS-based speech synthesis, and (ii) a novel one-time autoregressive decoding mechanism integrating multiple LoRA adapters, jointly preserving out-of-domain generalization while enhancing domain specificity. Experiments show 10โ€“17% WER reduction across all target domains, with only a 1% degradation on LibriSpeech (out-of-domain test set); inference RTF increases by just 9%, supporting real-time parallel fusion of up to three LoRAs. To our knowledge, DAS is the first ASR domain adaptation method achieving high generalization, low computational overhead, and full reliance on synthetic dataโ€”without any real labeled speech.

Technology Category

Application Category

๐Ÿ“ Abstract
We introduce DAS (Domain Adaptation with Synthetic data), a novel domain adaptation framework for pre-trained ASR model, designed to efficiently adapt to various language-defined domains without requiring any real data. In particular, DAS first prompts large language models (LLMs) to generate domain-specific texts before converting these texts to speech via text-to-speech technology. The synthetic data is used to fine-tune Whisper with Low-Rank Adapters (LoRAs) for targeted domains such as music, weather, and sports. We introduce a novel one-pass decoding strategy that merges predictions from multiple LoRA adapters efficiently during the auto-regressive text generation process. Experimental results show significant improvements, reducing the Word Error Rate (WER) by 10% to 17% across all target domains compared to the original model, with minimal performance regression in out-of-domain settings (e.g., -1% on Librispeech test sets). We also demonstrate that DAS operates efficiently during inference, introducing an additional 9% increase in Real Time Factor (RTF) compared to the original model when inferring with three LoRA adapters.
Problem

Research questions and friction points this paper is trying to address.

Speech Recognition
Topic Adaptation
Data Scarcity
Innovation

Methods, ideas, or system contributions that make the work stand out.

DAS
Synthetic Speech Data
Efficient Decoding Method
๐Ÿ”Ž Similar Papers
No similar papers found.