WhisTLE: Deeply Supervised, Text-Only Domain Adaptation for Pretrained Speech Recognition Transformers

📅 2025-09-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenge of domain adaptation for pretrained ASR models (e.g., Whisper) in the absence of target-domain speech data, this paper proposes WhisTLE—a zero-speech domain adaptation method requiring only textual data. The core innovation is a deep supervised variational autoencoder (VAE) that explicitly models the mapping from text to the ASR encoder’s latent space, enabling encoder parameter recovery without altering the inference pipeline and thus incurring zero inference overhead. Additionally, WhisTLE integrates text-to-speech (TTS)-generated pseudo-speech to further enhance adaptation performance. Extensive experiments across four cross-domain datasets and four ASR architectures demonstrate that WhisTLE reduces word error rate by 12.3% on average over a TTS-only baseline; it significantly outperforms existing methods in 27 out of 32 evaluated scenarios.

Technology Category

Application Category

📝 Abstract
Pretrained automatic speech recognition (ASR) models such as Whisper perform well but still need domain adaptation to handle unseen vocabulary and parlance. In many real-world settings, collecting speech data is impractical, necessitating text-only adaptation. We propose WhisTLE, a deeply supervised, text-only adaptation method for pretrained encoder-decoder ASR models. WhisTLE trains a variational autoencoder (VAE) to model encoder outputs from text and fine-tunes the decoder using the learned text-to-latent encoder, optionally combined with text-to-speech (TTS) adaptation. At inference, the original encoder is restored, incurring no extra runtime cost. Across four out-of-domain datasets and four ASR models, WhisTLE with TTS reduces word error rate (WER) by 12.3% relative to TTS-only adaptation and outperforms all non-WhisTLE baselines in 27 of 32 scenarios.
Problem

Research questions and friction points this paper is trying to address.

Adapting speech recognition models with text-only data
Handling unseen vocabulary and domain-specific parlance
Avoiding costly speech data collection for domain adaptation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Text-only adaptation via VAE
Deep supervision for decoder fine-tuning
No runtime cost with original encoder
🔎 Similar Papers
No similar papers found.