🤖 AI Summary
To address the limited robustness of automatic speech recognition (ASR) in novel utterance recognition and grammatical error correction, this paper proposes an acoustic–linguistic co-modeling paradigm. Specifically, it introduces instruction-tuned large language models (LLMs) as zero-shot front-end linguistic feature extractors—integrated directly into end-to-end ASR decoding without additional annotations or task-specific fine-tuning. Built upon a joint CTC-attention architecture, the method leverages the LLM to perform zero-shot grammatical error correction and semantic rescoring of CTC-generated hypotheses; the refined semantic representations are then injected into the decoder. Experiments on mainstream benchmarks demonstrate a 13% relative reduction in word error rate (WER), with substantial improvements in accuracy for long sentences, noisy conditions, and syntactically complex utterances. This work establishes a novel paradigm for deep, zero-shot coupling of LLMs with ASR systems.
📝 Abstract
We propose to utilize an instruction-tuned large language model (LLM) for guiding the text generation process in automatic speech recognition (ASR). Modern large language models (LLMs) are adept at performing various text generation tasks through zero-shot learning, prompted with instructions designed for specific objectives. This paper explores the potential of LLMs to derive linguistic information that can facilitate text generation in end-to-end ASR models. Specifically, we instruct an LLM to correct grammatical errors in an ASR hypothesis and use the LLM-derived representations to refine the output further. The proposed model is built on the joint CTC and attention architecture, with the LLM serving as a front-end feature extractor for the decoder. The ASR hypothesis, subject to correction, is obtained from the encoder via CTC decoding and fed into the LLM along with a specific instruction. The decoder subsequently takes as input the LLM output to perform token predictions, combining acoustic information from the encoder and the powerful linguistic information provided by the LLM. Experimental results show that the proposed LLM-guided model achieves a relative gain of approximately 13% in word error rates across major benchmarks.