Two Heads Are Better Than One: Audio-Visual Speech Error Correction with Dual Hypotheses

📅 2025-10-15

📈 Citations: 0

✨ Influential: 0

career value

204K/year

🤖 AI Summary

This work addresses the insufficient robustness of multimodal error correction in audio-visual speech recognition (AVSR). We propose DualHyp, a novel framework that first generates N-best hypotheses independently from audio and visual modalities. Subsequently, a large language model (LLM) operates in the linguistic space, leveraging modality-grounded prompting and RelPrompt—a noise-aware guidance mechanism—to dynamically assess temporal reliability per modality and adaptively weight and fuse cross-modal hypotheses for generative error correction. By decoupling modalities during correction, DualHyp avoids the modality-coupling bias inherent in conventional single-stream approaches. On the LRS2 benchmark, it reduces word error rate by up to 57.7% over ASR baselines and outperforms state-of-the-art single-stream methods by approximately 10%. To our knowledge, DualHyp is the first LLM-driven, reliability-aware, dual-modal generative error correction framework for AVSR.

Technology Category

Application Category

📝 Abstract

This paper introduces a new paradigm for generative error correction (GER) framework in audio-visual speech recognition (AVSR) that reasons over modality-specific evidences directly in the language space. Our framework, DualHyp, empowers a large language model (LLM) to compose independent N-best hypotheses from separate automatic speech recognition (ASR) and visual speech recognition (VSR) models. To maximize the effectiveness of DualHyp, we further introduce RelPrompt, a noise-aware guidance mechanism that provides modality-grounded prompts to the LLM. RelPrompt offers the temporal reliability of each modality stream, guiding the model to dynamically switch its focus between ASR and VSR hypotheses for an accurate correction. Under various corruption scenarios, our framework attains up to 57.7% error rate gain on the LRS2 benchmark over standard ASR baseline, contrary to single-stream GER approaches that achieve only 10% gain. To facilitate research within our DualHyp framework, we release the code and the dataset comprising ASR and VSR hypotheses at https://github.com/sungnyun/dualhyp.

Problem

Research questions and friction points this paper is trying to address.

Correcting speech recognition errors using dual audio-visual hypotheses

Dynamically switching focus between auditory and visual modality streams

Improving error correction under diverse speech corruption scenarios

Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM composes independent ASR and VSR hypotheses

Noise-aware guidance provides modality-grounded prompts

Dynamic switching between audio and visual reliability

🔎 Similar Papers

No similar papers found.