Speech LLMs are Contextual Reasoning Transcribers

📅 2026-04-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work proposes Chain-of-Thought ASR (CoT-ASR), a novel approach that leverages the contextual reasoning capabilities of large language models (LLMs) for automatic speech recognition (ASR), moving beyond conventional LLM-based methods that treat ASR as a simple speech-to-text mapping. CoT-ASR introduces chain-of-thought reasoning into ASR by first prompting the LLM to analyze the contextual content of the input speech and then generating a more accurate transcription based on this analysis, while also supporting user-guided transcription modes. To effectively align speech and textual representations, the method employs a CTC-guided modality adapter that weights LLM embeddings using the probabilities of non-blank tokens from Connectionist Temporal Classification (CTC). Experimental results demonstrate that CoT-ASR achieves a relative reduction of 8.7% in word error rate (WER) and 16.9% in entity error rate (EER) compared to standard LLM-based ASR systems.
📝 Abstract
Despite extensions to speech inputs, effectively leveraging the rich knowledge and contextual understanding of large language models (LLMs) in automatic speech recognition (ASR) remains non-trivial, as the task primarily involves direct speech-to-text mapping. To address this, this paper proposes chain-of-thought ASR (CoT-ASR), which constructs a reasoning chain that enables LLMs to first analyze the input speech and generate contextual analysis, thereby fully exploiting their generative capabilities. With this contextual reasoning, CoT-ASR then performs more informed speech recognition and completes both reasoning and transcription in a single pass. Moreover, CoT-ASR naturally supports user-guided transcription: while designed to self-generate reasoning, it can also seamlessly incorporate user-provided context to guide transcription, further extending ASR functionality. To reduce the modality gap, this paper introduces a CTC-guided Modality Adapter, which uses CTC non-blank token probabilities to weight LLM embeddings, efficiently aligning speech encoder outputs with the LLM's textual latent space. Experiments show that, compared to standard LLM-based ASR, CoT-ASR achieves a relative reduction of 8.7% in word error rate (WER) and 16.9% in entity error rate (EER).
Problem

Research questions and friction points this paper is trying to address.

speech recognition
large language models
contextual reasoning
automatic speech recognition
modality gap
Innovation

Methods, ideas, or system contributions that make the work stand out.

Chain-of-Thought ASR
Contextual Reasoning
Modality Adapter
CTC-guided Alignment
User-Guided Transcription
🔎 Similar Papers
No similar papers found.