A Domain Adaptation Framework for Speech Recognition Systems with Only Synthetic data

📅 2025-01-21

📈 Citations: 0

✨ Influential: 0

career value

207K/year

🤖 AI Summary

To address the degradation of ASR model performance across domains due to scarcity of in-domain speech data, this paper proposes DAS—a zero-shot, real-label-free domain adaptation framework for Whisper, leveraging solely synthetic speech for adaptation to target domains such as music, weather, and sports. Methodologically, DAS introduces (i) an LLM-driven pipeline for domain-specific text generation and TTS-based speech synthesis, and (ii) a novel one-time autoregressive decoding mechanism integrating multiple LoRA adapters, jointly preserving out-of-domain generalization while enhancing domain specificity. Experiments show 10–17% WER reduction across all target domains, with only a 1% degradation on LibriSpeech (out-of-domain test set); inference RTF increases by just 9%, supporting real-time parallel fusion of up to three LoRAs. To our knowledge, DAS is the first ASR domain adaptation method achieving high generalization, low computational overhead, and full reliance on synthetic data—without any real labeled speech.

Technology Category

Application Category

📝 Abstract

We introduce DAS (Domain Adaptation with Synthetic data), a novel domain adaptation framework for pre-trained ASR model, designed to efficiently adapt to various language-defined domains without requiring any real data. In particular, DAS first prompts large language models (LLMs) to generate domain-specific texts before converting these texts to speech via text-to-speech technology. The synthetic data is used to fine-tune Whisper with Low-Rank Adapters (LoRAs) for targeted domains such as music, weather, and sports. We introduce a novel one-pass decoding strategy that merges predictions from multiple LoRA adapters efficiently during the auto-regressive text generation process. Experimental results show significant improvements, reducing the Word Error Rate (WER) by 10% to 17% across all target domains compared to the original model, with minimal performance regression in out-of-domain settings (e.g., -1% on Librispeech test sets). We also demonstrate that DAS operates efficiently during inference, introducing an additional 9% increase in Real Time Factor (RTF) compared to the original model when inferring with three LoRA adapters.

Problem

Research questions and friction points this paper is trying to address.

Speech Recognition

Topic Adaptation

Data Scarcity

Innovation

Methods, ideas, or system contributions that make the work stand out.

DAS

Synthetic Speech Data

Efficient Decoding Method

🔎 Similar Papers

Personalized Speech Recognition for Children with Test-Time Adaptation