Analysis of Domain Shift across ASR Architectures via TTS-Enabled Separation of Target Domain and Acoustic Conditions

📅 2025-08-13

📈 Citations: 0

✨ Influential: 0

career value

215K/year

🤖 AI Summary

This work investigates the root causes of performance disparities in automatic speech recognition (ASR) models under domain shift. To isolate the effects of architectural paradigms (modular vs. end-to-end seq2seq), we systematically control key modeling variables—including output tokenization (e.g., phonemes, subwords), context length, and network topology—and employ text-to-speech (TTS) synthesis to generate target-domain speech, thereby decoupling acoustic and linguistic domain shifts. Experiments leverage LibriSpeech-pretrained models, with domain adaptation via n-gram and neural language models. Results demonstrate that performance differences stem primarily from specific modeling choices—not the high-level architectural paradigm itself. Critical factors include subword tokenization and long-context modeling, which exert decisive influence under domain mismatch. This study provides interpretable, empirical guidance for designing robust ASR systems, clarifying how concrete design decisions—not abstract architecture categories—govern cross-domain generalization.

Technology Category

Application Category

📝 Abstract

We analyze automatic speech recognition (ASR) modeling choices under domain mismatch, comparing classic modular and novel sequence-to-sequence (seq2seq) architectures. Across the different ASR architectures, we examine a spectrum of modeling choices, including label units, context length, and topology. To isolate language domain effects from acoustic variation, we synthesize target domain audio using a text-to-speech system trained on LibriSpeech. We incorporate target domain n-gram and neural language models for domain adaptation without retraining the acoustic model. To our knowledge, this is the first controlled comparison of optimized ASR systems across state-of-the-art architectures under domain shift, offering insights into their generalization. The results show that, under domain shift, rather than the decoder architecture choice or the distinction between classic modular and novel seq2seq models, it is specific modeling choices that influence performance.

Problem

Research questions and friction points this paper is trying to address.

Analyzing ASR performance under domain mismatch conditions

Comparing modular and seq2seq architectures for domain adaptation

Isolating language domain effects from acoustic variations

Innovation

Methods, ideas, or system contributions that make the work stand out.

Synthesize target domain audio via TTS

Use n-gram and neural LMs for adaptation

Compare modular and seq2seq ASR architectures

🔎 Similar Papers

Personalized Speech Recognition for Children with Test-Time Adaptation