🤖 AI Summary
Existing audio-visual question answering (AVQA) methods suffer from poor generalization robustness due to dataset bias and lack standardized diagnostic benchmarks. To address this, we propose FortisAVQA—the first robustness-evaluation benchmark for AVQA—featuring rephrased questions and distribution-shift test sets. We further introduce MAVEN, a multimodal cognitive network incorporating a recurrent collaborative debiasing framework that jointly performs multimodal feature alignment, recurrent co-attention, and distribution-aware adversarial training. This plug-and-play framework explicitly exposes and mitigates model reliance on spurious correlations. On FortisAVQA, MAVEN achieves state-of-the-art performance (+7.81% accuracy), while demonstrating strong cross-dataset generalization on MUSIC-AVQA. Comprehensive ablation and diagnostic analyses systematically uncover critical robustness deficiencies in prevailing AVQA models. All data and code are publicly released.
📝 Abstract
Audio-Visual Question Answering (AVQA) is a challenging multimodal reasoning task requiring intelligent systems to answer natural language queries based on paired audio-video inputs accurately. However, existing AVQA approaches often suffer from overfitting to dataset biases, leading to poor robustness. Moreover, current datasets may not effectively diagnose these methods. To address these challenges, we first introduce a novel dataset, FortisAVQA, constructed in two stages: (1) rephrasing questions in the test split of the public MUSIC-AVQA dataset and (2) introducing distribution shifts across questions. The first stage expands the test space with greater diversity, while the second enables a refined robustness evaluation across rare, frequent, and overall question distributions. Second, we introduce a robust Multimodal Audio-Visual Epistemic Network (MAVEN) that leverages a multifaceted cycle collaborative debiasing strategy to mitigate bias learning. Experimental results demonstrate that our architecture achieves state-of-the-art performance on FortisAVQA, with a notable improvement of 7.81%. Extensive ablation studies on both datasets validate the effectiveness of our debiasing components. Additionally, our evaluation reveals the limited robustness of existing multimodal QA methods. We also verify the plug-and-play capability of our strategy by integrating it with various baseline models across both datasets. Our dataset and code are available at https://github.com/reml-group/fortisavqa.