🤖 AI Summary
This work proposes Chain-of-Thought ASR (CoT-ASR), a novel approach that leverages the contextual reasoning capabilities of large language models (LLMs) for automatic speech recognition (ASR), moving beyond conventional LLM-based methods that treat ASR as a simple speech-to-text mapping. CoT-ASR introduces chain-of-thought reasoning into ASR by first prompting the LLM to analyze the contextual content of the input speech and then generating a more accurate transcription based on this analysis, while also supporting user-guided transcription modes. To effectively align speech and textual representations, the method employs a CTC-guided modality adapter that weights LLM embeddings using the probabilities of non-blank tokens from Connectionist Temporal Classification (CTC). Experimental results demonstrate that CoT-ASR achieves a relative reduction of 8.7% in word error rate (WER) and 16.9% in entity error rate (EER) compared to standard LLM-based ASR systems.
📝 Abstract
Despite extensions to speech inputs, effectively leveraging the rich knowledge and contextual understanding of large language models (LLMs) in automatic speech recognition (ASR) remains non-trivial, as the task primarily involves direct speech-to-text mapping. To address this, this paper proposes chain-of-thought ASR (CoT-ASR), which constructs a reasoning chain that enables LLMs to first analyze the input speech and generate contextual analysis, thereby fully exploiting their generative capabilities. With this contextual reasoning, CoT-ASR then performs more informed speech recognition and completes both reasoning and transcription in a single pass. Moreover, CoT-ASR naturally supports user-guided transcription: while designed to self-generate reasoning, it can also seamlessly incorporate user-provided context to guide transcription, further extending ASR functionality. To reduce the modality gap, this paper introduces a CTC-guided Modality Adapter, which uses CTC non-blank token probabilities to weight LLM embeddings, efficiently aligning speech encoder outputs with the LLM's textual latent space. Experiments show that, compared to standard LLM-based ASR, CoT-ASR achieves a relative reduction of 8.7% in word error rate (WER) and 16.9% in entity error rate (EER).