🤖 AI Summary
To address the challenge of domain adaptation for pretrained ASR models (e.g., Whisper) in the absence of target-domain speech data, this paper proposes WhisTLE—a zero-speech domain adaptation method requiring only textual data. The core innovation is a deep supervised variational autoencoder (VAE) that explicitly models the mapping from text to the ASR encoder’s latent space, enabling encoder parameter recovery without altering the inference pipeline and thus incurring zero inference overhead. Additionally, WhisTLE integrates text-to-speech (TTS)-generated pseudo-speech to further enhance adaptation performance. Extensive experiments across four cross-domain datasets and four ASR architectures demonstrate that WhisTLE reduces word error rate by 12.3% on average over a TTS-only baseline; it significantly outperforms existing methods in 27 out of 32 evaluated scenarios.
📝 Abstract
Pretrained automatic speech recognition (ASR) models such as Whisper perform well but still need domain adaptation to handle unseen vocabulary and parlance. In many real-world settings, collecting speech data is impractical, necessitating text-only adaptation. We propose WhisTLE, a deeply supervised, text-only adaptation method for pretrained encoder-decoder ASR models. WhisTLE trains a variational autoencoder (VAE) to model encoder outputs from text and fine-tunes the decoder using the learned text-to-latent encoder, optionally combined with text-to-speech (TTS) adaptation. At inference, the original encoder is restored, incurring no extra runtime cost. Across four out-of-domain datasets and four ASR models, WhisTLE with TTS reduces word error rate (WER) by 12.3% relative to TTS-only adaptation and outperforms all non-WhisTLE baselines in 27 of 32 scenarios.