🤖 AI Summary
This work addresses the performance degradation commonly observed in large language model (LLM)-based speech recognition systems when adapting solely with in-domain text, a process that often disrupts the alignment between speech and text modalities. To mitigate this issue, the authors propose a lightweight text-denoising adaptation approach that reformulates the audio projection task as a text denoising problem. By training the LLM to reconstruct clean transcripts from noisy textual inputs, the method achieves effective domain adaptation without modifying the model architecture or introducing additional parameters. This strategy preserves cross-modal alignment while significantly improving recognition accuracy, yielding up to a 22.1% relative reduction in word error rate on two benchmark datasets—substantially outperforming current state-of-the-art text-only adaptation techniques.
📝 Abstract
Adapting automatic speech recognition (ASR) systems based on large language models (LLMs) to new domains using text-only data is a significant yet underexplored challenge. Standard fine-tuning of the LLM on target-domain text often disrupts the critical alignment between speech and text modalities learned by the projector, degrading performance. We introduce a novel text-only adaptation method that emulates the audio projection task by treating it as a text denoising task. Our approach thus trains the LLM to recover clean transcripts from noisy inputs. This process effectively adapts the model to a target domain while preserving cross-modal alignment. Our solution is lightweight, requiring no architectural changes or additional parameters. Extensive evaluation on two datasets demonstrates up to 22.1% relative improvement, outperforming recent state-of-the-art text-only adaptation methods.