Speech Recognition on TV Series with Video-guided Post-Correction

📅 2025-06-08

📈 Citations: 0

✨ Influential: 0

career value

175K/year

🤖 AI Summary

To address ASR transcription errors in complex scenarios such as TV dramas—caused by overlapping speech, domain-specific terminology, and long-range contextual dependencies—this paper proposes a video-guided two-stage post-correction framework: first generating an initial ASR transcript, then performing fine-grained error correction leveraging temporal and semantic cues from synchronized video. Methodologically, it introduces the first video-driven ASR post-correction paradigm, uniquely integrating prompt-based visual understanding from vision-language multimodal models (VLMMs) with contextual reasoning capabilities of large language models (LLMs) to enable cross-modal, fine-grained error localization and correction. By jointly modeling multimodal context and extracting task-relevant visual information through prompt engineering, the framework achieves significant accuracy improvements on a TV-drama ASR multimodal benchmark, effectively mitigating challenges including overlapping speech recognition, misrecognition of specialized terms, and insufficient modeling of long-distance contextual dependencies.

Technology Category

Application Category

📝 Abstract

Automatic Speech Recognition (ASR) has achieved remarkable success with deep learning, driving advancements in conversational artificial intelligence, media transcription, and assistive technologies. However, ASR systems still struggle in complex environments such as TV series, where overlapping speech, domain-specific terminology, and long-range contextual dependencies pose significant challenges to transcription accuracy. Existing multimodal approaches fail to correct ASR outputs with the rich temporal and contextual information available in video. To address this limitation, we propose a novel multimodal post-correction framework that refines ASR transcriptions by leveraging contextual cues extracted from video. Our framework consists of two stages: ASR Generation and Video-based Post-Correction, where the first stage produces the initial transcript and the second stage corrects errors using Video-based Contextual Information Extraction and Context-aware ASR Correction. We employ the Video-Large Multimodal Model (VLMM) to extract key contextual information using tailored prompts, which is then integrated with a Large Language Model (LLM) to refine the ASR output. We evaluate our method on a multimodal benchmark for TV series ASR and demonstrate its effectiveness in improving ASR performance by leveraging video-based context to enhance transcription accuracy in complex multimedia environments.

Problem

Research questions and friction points this paper is trying to address.

Improving ASR accuracy in TV series with complex audio challenges

Leveraging video context to correct ASR transcription errors

Integrating VLMM and LLM for multimodal post-correction framework

Innovation

Methods, ideas, or system contributions that make the work stand out.

Video-guided post-correction for ASR

VLMM extracts contextual video cues

LLM integrates video context for correction

🔎 Similar Papers

No similar papers found.