🤖 AI Summary
To address ASR transcription errors in complex scenarios such as TV dramas—caused by overlapping speech, domain-specific terminology, and long-range contextual dependencies—this paper proposes a video-guided two-stage post-correction framework: first generating an initial ASR transcript, then performing fine-grained error correction leveraging temporal and semantic cues from synchronized video. Methodologically, it introduces the first video-driven ASR post-correction paradigm, uniquely integrating prompt-based visual understanding from vision-language multimodal models (VLMMs) with contextual reasoning capabilities of large language models (LLMs) to enable cross-modal, fine-grained error localization and correction. By jointly modeling multimodal context and extracting task-relevant visual information through prompt engineering, the framework achieves significant accuracy improvements on a TV-drama ASR multimodal benchmark, effectively mitigating challenges including overlapping speech recognition, misrecognition of specialized terms, and insufficient modeling of long-distance contextual dependencies.
📝 Abstract
Automatic Speech Recognition (ASR) has achieved remarkable success with deep learning, driving advancements in conversational artificial intelligence, media transcription, and assistive technologies. However, ASR systems still struggle in complex environments such as TV series, where overlapping speech, domain-specific terminology, and long-range contextual dependencies pose significant challenges to transcription accuracy. Existing multimodal approaches fail to correct ASR outputs with the rich temporal and contextual information available in video. To address this limitation, we propose a novel multimodal post-correction framework that refines ASR transcriptions by leveraging contextual cues extracted from video. Our framework consists of two stages: ASR Generation and Video-based Post-Correction, where the first stage produces the initial transcript and the second stage corrects errors using Video-based Contextual Information Extraction and Context-aware ASR Correction. We employ the Video-Large Multimodal Model (VLMM) to extract key contextual information using tailored prompts, which is then integrated with a Large Language Model (LLM) to refine the ASR output. We evaluate our method on a multimodal benchmark for TV series ASR and demonstrate its effectiveness in improving ASR performance by leveraging video-based context to enhance transcription accuracy in complex multimedia environments.