WhisTLE: Deeply Supervised, Text-Only Domain Adaptation for Pretrained Speech Recognition Transformers

📅 2025-09-12

📈 Citations: 0

✨ Influential: 0

career value

183K/year

🤖 AI Summary

To address the challenge of domain adaptation for pretrained ASR models (e.g., Whisper) in the absence of target-domain speech data, this paper proposes WhisTLE—a zero-speech domain adaptation method requiring only textual data. The core innovation is a deep supervised variational autoencoder (VAE) that explicitly models the mapping from text to the ASR encoder’s latent space, enabling encoder parameter recovery without altering the inference pipeline and thus incurring zero inference overhead. Additionally, WhisTLE integrates text-to-speech (TTS)-generated pseudo-speech to further enhance adaptation performance. Extensive experiments across four cross-domain datasets and four ASR architectures demonstrate that WhisTLE reduces word error rate by 12.3% on average over a TTS-only baseline; it significantly outperforms existing methods in 27 out of 32 evaluated scenarios.

Technology Category

Application Category

📝 Abstract

Pretrained automatic speech recognition (ASR) models such as Whisper perform well but still need domain adaptation to handle unseen vocabulary and parlance. In many real-world settings, collecting speech data is impractical, necessitating text-only adaptation. We propose WhisTLE, a deeply supervised, text-only adaptation method for pretrained encoder-decoder ASR models. WhisTLE trains a variational autoencoder (VAE) to model encoder outputs from text and fine-tunes the decoder using the learned text-to-latent encoder, optionally combined with text-to-speech (TTS) adaptation. At inference, the original encoder is restored, incurring no extra runtime cost. Across four out-of-domain datasets and four ASR models, WhisTLE with TTS reduces word error rate (WER) by 12.3% relative to TTS-only adaptation and outperforms all non-WhisTLE baselines in 27 of 32 scenarios.

Problem

Research questions and friction points this paper is trying to address.

Adapting speech recognition models with text-only data

Handling unseen vocabulary and domain-specific parlance

Avoiding costly speech data collection for domain adaptation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Text-only adaptation via VAE

Deep supervision for decoder fine-tuning

No runtime cost with original encoder

🔎 Similar Papers

No similar papers found.