WhisTLE: Deeply Supervised, Text-Only Domain Adaptation for Pretrained Speech Recognition Transformers

๐Ÿ“… 2025-09-12
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF

career value

207K/year
๐Ÿค– AI Summary
To address the challenge of domain adaptation for pretrained ASR models (e.g., Whisper) in the absence of target-domain speech data, this paper proposes WhisTLEโ€”a zero-speech domain adaptation method requiring only textual data. The core innovation is a deep supervised variational autoencoder (VAE) that explicitly models the mapping from text to the ASR encoderโ€™s latent space, enabling encoder parameter recovery without altering the inference pipeline and thus incurring zero inference overhead. Additionally, WhisTLE integrates text-to-speech (TTS)-generated pseudo-speech to further enhance adaptation performance. Extensive experiments across four cross-domain datasets and four ASR architectures demonstrate that WhisTLE reduces word error rate by 12.3% on average over a TTS-only baseline; it significantly outperforms existing methods in 27 out of 32 evaluated scenarios.

Technology Category

Application Category

๐Ÿ“ Abstract
Pretrained automatic speech recognition (ASR) models such as Whisper perform well but still need domain adaptation to handle unseen vocabulary and parlance. In many real-world settings, collecting speech data is impractical, necessitating text-only adaptation. We propose WhisTLE, a deeply supervised, text-only adaptation method for pretrained encoder-decoder ASR models. WhisTLE trains a variational autoencoder (VAE) to model encoder outputs from text and fine-tunes the decoder using the learned text-to-latent encoder, optionally combined with text-to-speech (TTS) adaptation. At inference, the original encoder is restored, incurring no extra runtime cost. Across four out-of-domain datasets and four ASR models, WhisTLE with TTS reduces word error rate (WER) by 12.3% relative to TTS-only adaptation and outperforms all non-WhisTLE baselines in 27 of 32 scenarios.
Problem

Research questions and friction points this paper is trying to address.

Adapting speech recognition models with text-only data
Handling unseen vocabulary and domain-specific parlance
Avoiding costly speech data collection for domain adaptation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Text-only adaptation via VAE
Deep supervision for decoder fine-tuning
No runtime cost with original encoder
๐Ÿ”Ž Similar Papers
No similar papers found.