Text-only adaptation in LLM-based ASR through text denoising

📅 2026-01-28
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the performance degradation commonly observed in large language model (LLM)-based speech recognition systems when adapting solely with in-domain text, a process that often disrupts the alignment between speech and text modalities. To mitigate this issue, the authors propose a lightweight text-denoising adaptation approach that reformulates the audio projection task as a text denoising problem. By training the LLM to reconstruct clean transcripts from noisy textual inputs, the method achieves effective domain adaptation without modifying the model architecture or introducing additional parameters. This strategy preserves cross-modal alignment while significantly improving recognition accuracy, yielding up to a 22.1% relative reduction in word error rate on two benchmark datasets—substantially outperforming current state-of-the-art text-only adaptation techniques.

Technology Category

Application Category

📝 Abstract
Adapting automatic speech recognition (ASR) systems based on large language models (LLMs) to new domains using text-only data is a significant yet underexplored challenge. Standard fine-tuning of the LLM on target-domain text often disrupts the critical alignment between speech and text modalities learned by the projector, degrading performance. We introduce a novel text-only adaptation method that emulates the audio projection task by treating it as a text denoising task. Our approach thus trains the LLM to recover clean transcripts from noisy inputs. This process effectively adapts the model to a target domain while preserving cross-modal alignment. Our solution is lightweight, requiring no architectural changes or additional parameters. Extensive evaluation on two datasets demonstrates up to 22.1% relative improvement, outperforming recent state-of-the-art text-only adaptation methods.
Problem

Research questions and friction points this paper is trying to address.

text-only adaptation
LLM-based ASR
domain adaptation
cross-modal alignment
automatic speech recognition
Innovation

Methods, ideas, or system contributions that make the work stand out.

text-only adaptation
text denoising
LLM-based ASR
cross-modal alignment
domain adaptation
🔎 Similar Papers
No similar papers found.
S
Sergio Gastón Burdisso
Idiap Research Institute
E
Esaú Villatoro-Tello
Idiap Research Institute
A
Andrés Carofilis
Idiap Research Institute
Shashi Kumar
Shashi Kumar
PhD student@Idiap Research Institute, Switzerland | EPFL, Switzerland
Automatic Speech RecognitionMultitask learningStreaming ASRLLM-ASR
K
Kadri Hacioglu
Uniphore
S
S. Madikeri
University of Zurich
Pradeep Rangappa
Pradeep Rangappa
Senior Speech Applied Scientist (Remote) @Omilia | Postdoc Idiap | Ex- Swiggy | PhD IIT Kharagpur
Speech RecognitionMachine LearningSpeaker Diarization
E
E. ManjunathK
Uniphore
P
Petr Motlícek
Idiap Research Institute, Brno University of Technology
S
Shankar Venkatesan
Uniphore
A
A. Stolcke
Uniphore