Enhancing Audiovisual Speech Recognition through Bifocal Preference Optimization

📅 2024-12-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address performance degradation in audio-visual automatic speech recognition (AV-ASR) caused by noisy audio, spontaneous speech fluency, and audio-visual asynchrony in real classroom settings, this paper proposes a Dual-Focus Preference Optimization (DFPO) framework. Unlike conventional end-to-end fine-tuning, DFPO innovatively models typical AV-ASR error patterns jointly on the input side (via audio/visual perturbations) and output side (via transcription corrections), and constructs a dual-focus preference dataset based on simulated errors. It integrates multimodal alignment with sequence-level reward modeling and employs the BPO-AVASR algorithm for preference optimization. Evaluated on multiple real-world educational video benchmarks, DFPO achieves an average 12.3% relative reduction in word error rate (WER), significantly enhancing robustness under low signal-to-noise ratio and severe audio-visual asynchrony conditions. This work establishes a novel paradigm for practical, middle-school-oriented AV-ASR systems.

Technology Category

Application Category

📝 Abstract
Audiovisual Automatic Speech Recognition (AV-ASR) aims to improve speech recognition accuracy by leveraging visual signals. It is particularly challenging in unconstrained real-world scenarios across various domains due to noisy acoustic environments, spontaneous speech, and the uncertain use of visual information. Most previous works fine-tune audio-only ASR models on audiovisual datasets, optimizing them for conventional ASR objectives. However, they often neglect visual features and common errors in unconstrained video scenarios. In this paper, we propose using a preference optimization strategy to improve speech recognition accuracy for real-world videos. First, we create preference data via simulating common errors that occurred in AV-ASR from two focals: manipulating the audio or vision input and rewriting the output transcript. Second, we propose BPO-AVASR, a Bifocal Preference Optimization method to improve AV-ASR models by leveraging both input-side and output-side preference. Extensive experiments demonstrate that our approach significantly improves speech recognition accuracy across various domains, outperforming previous state-of-the-art models on real-world video speech recognition.
Problem

Research questions and friction points this paper is trying to address.

Audio-Visual Speech Recognition
Noisy Environment
Video Instability
Innovation

Methods, ideas, or system contributions that make the work stand out.

BPO-AVASR
Dual-Angle Preference Optimization
Audio-Visual Speech Recognition