FortisAVQA and MAVEN: a Benchmark Dataset and Debiasing Framework for Robust Multimodal Reasoning

📅 2025-04-01

📈 Citations: 0

✨ Influential: 0

career value

179K/year

🤖 AI Summary

Existing audio-visual question answering (AVQA) methods suffer from poor generalization robustness due to dataset bias and lack standardized diagnostic benchmarks. To address this, we propose FortisAVQA—the first robustness-evaluation benchmark for AVQA—featuring rephrased questions and distribution-shift test sets. We further introduce MAVEN, a multimodal cognitive network incorporating a recurrent collaborative debiasing framework that jointly performs multimodal feature alignment, recurrent co-attention, and distribution-aware adversarial training. This plug-and-play framework explicitly exposes and mitigates model reliance on spurious correlations. On FortisAVQA, MAVEN achieves state-of-the-art performance (+7.81% accuracy), while demonstrating strong cross-dataset generalization on MUSIC-AVQA. Comprehensive ablation and diagnostic analyses systematically uncover critical robustness deficiencies in prevailing AVQA models. All data and code are publicly released.

Technology Category

Application Category

📝 Abstract

Audio-Visual Question Answering (AVQA) is a challenging multimodal reasoning task requiring intelligent systems to answer natural language queries based on paired audio-video inputs accurately. However, existing AVQA approaches often suffer from overfitting to dataset biases, leading to poor robustness. Moreover, current datasets may not effectively diagnose these methods. To address these challenges, we first introduce a novel dataset, FortisAVQA, constructed in two stages: (1) rephrasing questions in the test split of the public MUSIC-AVQA dataset and (2) introducing distribution shifts across questions. The first stage expands the test space with greater diversity, while the second enables a refined robustness evaluation across rare, frequent, and overall question distributions. Second, we introduce a robust Multimodal Audio-Visual Epistemic Network (MAVEN) that leverages a multifaceted cycle collaborative debiasing strategy to mitigate bias learning. Experimental results demonstrate that our architecture achieves state-of-the-art performance on FortisAVQA, with a notable improvement of 7.81%. Extensive ablation studies on both datasets validate the effectiveness of our debiasing components. Additionally, our evaluation reveals the limited robustness of existing multimodal QA methods. We also verify the plug-and-play capability of our strategy by integrating it with various baseline models across both datasets. Our dataset and code are available at https://github.com/reml-group/fortisavqa.

Problem

Research questions and friction points this paper is trying to address.

Addressing overfitting to biases in Audio-Visual Question Answering (AVQA) tasks

Introducing a novel dataset (FortisAVQA) for robust multimodal reasoning evaluation

Proposing a debiasing framework (MAVEN) to improve AVQA model robustness

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces FortisAVQA dataset with diverse test questions

Proposes MAVEN framework for multimodal debiasing

Achieves 7.81% performance improvement on AVQA

🔎 Similar Papers

Reefknot: A Comprehensive Benchmark for Relation Hallucination Evaluation, Analysis and Mitigation in Multimodal Large Language Models

2024-08-18arXiv.orgCitations: 6