Seeing the Context: Rich Visual Context-Aware Speech Recognition via Multimodal Reasoning

📅 2026-03-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitation of existing audio-visual speech recognition methods, which overly rely on lip movements while neglecting rich visual contextual cues such as scene content and on-screen text, often leading to unimodal dominance. To overcome this, we propose an Audio-Visual Chain-of-Thought (AV-CoT) mechanism that explicitly models cross-modal alignment between acoustic signals and diverse visual evidence, enabling context-aware multimodal reasoning. We further present the first systematic data pipeline and benchmark test set tailored for context-aware audio-visual speech recognition, both of which are publicly released. Experimental results demonstrate that our approach achieves state-of-the-art performance on relevant tasks, significantly mitigating reliance on a single modality. Code and datasets are made openly available to support reproducibility and future research.

Technology Category

Application Category

📝 Abstract
Audio-visual speech recognition (AVSR) is an extension of ASR that incorporates visual signals. Current AVSR approaches primarily focus on lip motion, largely overlooking rich context present in the video such as speaking scene and on-screen text. To tackle such CAVSR (AVSR including rich visual Context), we propose VASR designed to"see"and reason the visual context to improve speech recognition. Specifically, we construct an Audio-Visual Chain-of-Thought (AV-CoT) that explicitly enforces intermediate cross-modal grounding between acoustic signals and visual evidence. This evidence-driven reasoning mitigates the"single-modality dominance"problem, where models either over-rely on visual context or fail to utilize it. Besides, to address the data scarcity, we construct and release a corresponding data pipeline and test set. Experiments show that AV-CoT effectively mitigates the single-modality dominance, achieving state-of-the-art performance in CAVSR. The project is open-sourced.
Problem

Research questions and friction points this paper is trying to address.

audio-visual speech recognition
visual context
multimodal reasoning
single-modality dominance
context-aware
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal Reasoning
Audio-Visual Chain-of-Thought
Context-Aware Speech Recognition
Cross-Modal Grounding
Single-Modality Dominance
🔎 Similar Papers
No similar papers found.
Wenjie Tian
Wenjie Tian
Northwest Polytechnical University
speech generation
M
Mingchen Shao
Northwestern Polytechnical University, Xi’an, China
Bingshen Mu
Bingshen Mu
Northwestern Polytechnical University
Speech RecognitionSpeech Understanding
Xuelong Geng
Xuelong Geng
School of Computer Science, Northwestern Polytechnical University
ASRLLMspeech
C
Chengyou Wang
Northwestern Polytechnical University, Xi’an, China
Y
Yujie Liao
Northwestern Polytechnical University, Xi’an, China
Zhixian Zhao
Zhixian Zhao
Northwestern Polytechnical University
Emotion Speech RecognitionUnderstanding and Generation
Z
Ziyu Zhang
Northwestern Polytechnical University, Xi’an, China
J
Jingbin Hu
Northwestern Polytechnical University, Xi’an, China
M
Mengqi Wei
Northwestern Polytechnical University, Xi’an, China
Lei Xie
Lei Xie
Northwestern Polytechnical University
speech processingspeech recognitionspeech synthesismultimediaartificial intelligence