Refining Pseudo-Audio Prompts with Speech-Text Alignment for Text-Only Domain Adaptation in LLM-Based ASR

📅 2026-05-14
📈 Citations: 0
Influential: 0
📄 PDF

career value

182K/year
🤖 AI Summary
This work addresses the challenge of domain adaptation in large-model speech recognition, where scarcity of paired speech-text data in new domains limits performance, and existing text-only adaptation methods fail to effectively capture acoustic characteristics. The authors propose an enhanced framework that innovatively integrates a speech-text alignment mechanism, coupling a large language model with an audio encoder to generate highly expressive pseudo-audio prompts from target-domain text alone. This approach bridges the modality gap without requiring any real speech data, enabling efficient domain adaptation. Experimental results demonstrate that the method significantly outperforms current text-only baselines across multiple metrics, achieving notably lower word error rates and improved coverage of out-of-vocabulary terms.
📝 Abstract
LLM-based automatic speech recognition models demonstrate strong performance by connecting audio encoders and LLMs. However, data scarcity of paired speech and transcription often hinders their adaptation to new domains, making text-only domain adaptation crucial. Existing methods typically rely on either fine-tuning the LLM alone or employing pseudo-audio prompts. The former neglects essential acoustic context, while the latter either suffers from limited scalability in data-scarce conditions, or yields inexpressive prompts by leveraging only textual features, ignoring audio modality. To address this, we propose an enhanced framework that explicitly models speech-text alignment. Our method efficiently generates highly expressive pseudo-audio prompts that bridges the modality gap, enabling effective target-domain adaptation. Experiments demonstrate that our approach outperforms existing text-only methods, improving both overall error rates and out-of-vocabulary coverage.
Problem

Research questions and friction points this paper is trying to address.

domain adaptation
automatic speech recognition
pseudo-audio prompts
speech-text alignment
text-only adaptation
Innovation

Methods, ideas, or system contributions that make the work stand out.

speech-text alignment
pseudo-audio prompts
text-only domain adaptation
LLM-based ASR
modality gap
🔎 Similar Papers
No similar papers found.