Low-Resource Domain Adaptation for Speech LLMs via Text-Only Fine-Tuning

📅 2025-06-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the cross-domain adaptation challenge of speech large language models (Speech LLMs) in low-resource settings—where target-domain speech-text pairs are unavailable and only unlabeled text is accessible—we propose a pure-text fine-tuning framework. Our method introduces a real-time speech-text alignment evaluation mechanism: while freezing the projection layer of the speech encoder, we optimize only the language model parameters, dynamically constraining textual outputs to maintain consistency with implicit speech representations. This design mitigates catastrophic forgetting and preserves source-domain performance. Experiments on LibriSpeech (source domain) and SlideSpeech and Medical (target domains) demonstrate that our approach achieves significant WER reduction under zero-shot speech conditions—matching the target-domain performance of full speech-text fine-tuning—while incurring less than 0.3% WER degradation on the source domain.

Technology Category

Application Category

📝 Abstract
Recent advances in automatic speech recognition (ASR) have combined speech encoders with large language models (LLMs) through projection, forming Speech LLMs with strong performance. However, adapting them to new domains remains challenging, especially in low-resource settings where paired speech-text data is scarce. We propose a text-only fine-tuning strategy for Speech LLMs using unpaired target-domain text without requiring additional audio. To preserve speech-text alignment, we introduce a real-time evaluation mechanism during fine-tuning. This enables effective domain adaptation while maintaining source-domain performance. Experiments on LibriSpeech, SlideSpeech, and Medical datasets show that our method achieves competitive recognition performance, with minimal degradation compared to full audio-text fine-tuning. It also improves generalization to new domains without catastrophic forgetting, highlighting the potential of text-only fine-tuning for low-resource domain adaptation of ASR.
Problem

Research questions and friction points this paper is trying to address.

Adapting Speech LLMs to new domains with scarce paired speech-text data
Maintaining speech-text alignment during text-only fine-tuning
Achieving competitive ASR performance in low-resource settings
Innovation

Methods, ideas, or system contributions that make the work stand out.

Text-only fine-tuning for Speech LLMs
Real-time evaluation preserves speech-text alignment
Effective domain adaptation without paired speech-text
🔎 Similar Papers
No similar papers found.
Yangui Fang
Yangui Fang
Huazhong University of Science and Technology
Speech LLMASR
J
Jing Peng
MoE Key Lab of Artificial Intelligence, AI Institute, X-LANCE Lab, Shanghai Jiao Tong University; Jiangsu Key Lab of Language Computing, Suzhou, China
X
Xu Li
AISpeech Co., Ltd., Suzhou, China
Y
Yu Xi
MoE Key Lab of Artificial Intelligence, AI Institute, X-LANCE Lab, Shanghai Jiao Tong University; Jiangsu Key Lab of Language Computing, Suzhou, China
C
Chengwei Zhang
Huazhong University of Science and Technology, School of Electronic Information and Communications
G
Guohui Zhong
Huazhong University of Science and Technology, School of Electronic Information and Communications
K
Kai Yu
MoE Key Lab of Artificial Intelligence, AI Institute, X-LANCE Lab, Shanghai Jiao Tong University; Jiangsu Key Lab of Language Computing, Suzhou, China