🤖 AI Summary
This work investigates the root causes of performance disparities in automatic speech recognition (ASR) models under domain shift. To isolate the effects of architectural paradigms (modular vs. end-to-end seq2seq), we systematically control key modeling variables—including output tokenization (e.g., phonemes, subwords), context length, and network topology—and employ text-to-speech (TTS) synthesis to generate target-domain speech, thereby decoupling acoustic and linguistic domain shifts. Experiments leverage LibriSpeech-pretrained models, with domain adaptation via n-gram and neural language models. Results demonstrate that performance differences stem primarily from specific modeling choices—not the high-level architectural paradigm itself. Critical factors include subword tokenization and long-context modeling, which exert decisive influence under domain mismatch. This study provides interpretable, empirical guidance for designing robust ASR systems, clarifying how concrete design decisions—not abstract architecture categories—govern cross-domain generalization.
📝 Abstract
We analyze automatic speech recognition (ASR) modeling choices under domain mismatch, comparing classic modular and novel sequence-to-sequence (seq2seq) architectures. Across the different ASR architectures, we examine a spectrum of modeling choices, including label units, context length, and topology. To isolate language domain effects from acoustic variation, we synthesize target domain audio using a text-to-speech system trained on LibriSpeech. We incorporate target domain n-gram and neural language models for domain adaptation without retraining the acoustic model. To our knowledge, this is the first controlled comparison of optimized ASR systems across state-of-the-art architectures under domain shift, offering insights into their generalization. The results show that, under domain shift, rather than the decoder architecture choice or the distinction between classic modular and novel seq2seq models, it is specific modeling choices that influence performance.