Closing the Speech-Text Gap with Limited Audio for Effective Domain Adaptation in LLM-Based ASR

📅 2026-04-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the modality mismatch between speech and text that arises when adapting large language models (LLMs) for automatic speech recognition using only textual data. To bridge the gap between the speech encoder and the LLM, the authors propose a hybrid batch training strategy that jointly leverages a small amount of target-domain speech data—less than four hours—and abundant text data. Remarkably, with only 10% of the target-domain speech data, the proposed approach significantly outperforms text-only adaptation on both in-domain and out-of-domain evaluations, achieving word error rates comparable to or even better than those obtained by conventional fine-tuning with full speech datasets. These results demonstrate the method’s high efficiency and practical applicability in low-resource domain adaptation scenarios.
📝 Abstract
Conventional end-to-end automatic speech recognition (ASR) systems rely on paired speech-text data for domain adaptation. Recent LLM-based ASR architectures connect a speech encoder to a large language model via a projection module, enabling adaptation with text-only data. However, this introduces a modality gap, as the LLM is not exposed to the noisy representations produced by the speech projector. We investigate whether small amounts of speech can mitigate this mismatch. We compare three strategies: text-only adaptation, paired speech-text adaptation, and mixed batching (MB), which combines both. Experiments in in-domain and out-of-domain settings show that even limited speech consistently improves performance. Notably, MB using only 10% of the target-domain (less than 4 hours) speech achieves word error rates comparable to, or better than, conventional ASR fine-tuning with the full dataset, indicating that small amounts of speech provide a strong modality-alignment signal.
Problem

Research questions and friction points this paper is trying to address.

modality gap
domain adaptation
LLM-based ASR
speech-text mismatch
limited audio
Innovation

Methods, ideas, or system contributions that make the work stand out.

modality gap
domain adaptation
mixed batching
LLM-based ASR
limited audio
🔎 Similar Papers
No similar papers found.
T
Thibault Bañeras-Roux
Idiap Research Institute, Switzerland
Sergio Burdisso
Sergio Burdisso
Researcher, Idiap Research Institute
artificial intelligencemachine learningnatural language processing
E
Esaú Villatoro-Tello
Idiap Research Institute, Switzerland
D
Dairazalia Sánchez-Cortés
Idiap Research Institute, Switzerland
S
Shiran Liu
Idiap Research Institute, Switzerland
S
Severin Baroudi
Idiap Research Institute, Switzerland; Laboratoire d’Informatique et des Systèmes, France
Shashi Kumar
Shashi Kumar
PhD student@Idiap Research Institute, Switzerland | EPFL, Switzerland
Automatic Speech RecognitionMultitask learningStreaming ASRLLM-ASR
H
Hasindri Watawana
Idiap Research Institute, Switzerland
M
Manjunath K E
Uniphore, USA & India
K
Kadri Hacioglu
Uniphore, USA & India
Petr Motlicek
Petr Motlicek
Idiap Research Institute
Artificial intelligencespeech and signal processingmachine learning
Andreas Stolcke
Andreas Stolcke
Distinguished AI Scientist, Uniphore
Speech Processing