Dr. SHAP-AV: Decoding Relative Modality Contributions via Shapley Attribution in Audio-Visual Speech Recognition

📅 2026-03-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the unclear dynamics of how audio-visual speech recognition (AVSR) systems weigh contributions from audio and visual modalities under noisy conditions. The authors propose Dr. SHAP-AV, a novel framework that introduces Shapley values into AVSR for the first time, quantifying modality contributions across three dimensions: global importance, generation process, and temporal alignment. Their analysis reveals that AVSR models remain significantly reliant on audio even at low signal-to-noise ratios (SNR), that modality weights are primarily governed by SNR and evolve dynamically during decoding, and that temporal alignment mechanisms exhibit robustness in noisy environments. Extensive experiments across two benchmarks and six state-of-the-art models validate the effectiveness of Dr. SHAP-AV, offering a new tool for interpretability and diagnostic analysis in AVSR research.

Technology Category

Application Category

📝 Abstract
Audio-Visual Speech Recognition (AVSR) leverages both acoustic and visual information for robust recognition under noise. However, how models balance these modalities remains unclear. We present Dr. SHAP-AV, a framework using Shapley values to analyze modality contributions in AVSR. Through experiments on six models across two benchmarks and varying SNR levels, we introduce three analyses: Global SHAP for overall modality balance, Generative SHAP for contribution dynamics during decoding, and Temporal Alignment SHAP for input-output correspondence. Our findings reveal that models shift toward visual reliance under noise yet maintain high audio contributions even under severe degradation. Modality balance evolves during generation, temporal alignment holds under noise, and SNR is the dominant factor driving modality weighting. These findings expose a persistent audio bias, motivating ad-hoc modality-weighting mechanisms and Shapley-based attribution as a standard AVSR diagnostic.
Problem

Research questions and friction points this paper is trying to address.

Audio-Visual Speech Recognition
Modality Contribution
Shapley Values
Noise Robustness
Multimodal Learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Shapley values
Audio-Visual Speech Recognition
modality contribution
interpretability
multimodal learning
🔎 Similar Papers
No similar papers found.