FortisAVQA and MAVEN: a Benchmark Dataset and Debiasing Framework for Robust Multimodal Reasoning

📅 2025-04-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing audio-visual question answering (AVQA) methods suffer from poor generalization robustness due to dataset bias and lack standardized diagnostic benchmarks. To address this, we propose FortisAVQA—the first robustness-evaluation benchmark for AVQA—featuring rephrased questions and distribution-shift test sets. We further introduce MAVEN, a multimodal cognitive network incorporating a recurrent collaborative debiasing framework that jointly performs multimodal feature alignment, recurrent co-attention, and distribution-aware adversarial training. This plug-and-play framework explicitly exposes and mitigates model reliance on spurious correlations. On FortisAVQA, MAVEN achieves state-of-the-art performance (+7.81% accuracy), while demonstrating strong cross-dataset generalization on MUSIC-AVQA. Comprehensive ablation and diagnostic analyses systematically uncover critical robustness deficiencies in prevailing AVQA models. All data and code are publicly released.

Technology Category

Application Category

📝 Abstract
Audio-Visual Question Answering (AVQA) is a challenging multimodal reasoning task requiring intelligent systems to answer natural language queries based on paired audio-video inputs accurately. However, existing AVQA approaches often suffer from overfitting to dataset biases, leading to poor robustness. Moreover, current datasets may not effectively diagnose these methods. To address these challenges, we first introduce a novel dataset, FortisAVQA, constructed in two stages: (1) rephrasing questions in the test split of the public MUSIC-AVQA dataset and (2) introducing distribution shifts across questions. The first stage expands the test space with greater diversity, while the second enables a refined robustness evaluation across rare, frequent, and overall question distributions. Second, we introduce a robust Multimodal Audio-Visual Epistemic Network (MAVEN) that leverages a multifaceted cycle collaborative debiasing strategy to mitigate bias learning. Experimental results demonstrate that our architecture achieves state-of-the-art performance on FortisAVQA, with a notable improvement of 7.81%. Extensive ablation studies on both datasets validate the effectiveness of our debiasing components. Additionally, our evaluation reveals the limited robustness of existing multimodal QA methods. We also verify the plug-and-play capability of our strategy by integrating it with various baseline models across both datasets. Our dataset and code are available at https://github.com/reml-group/fortisavqa.
Problem

Research questions and friction points this paper is trying to address.

Addressing overfitting to biases in Audio-Visual Question Answering (AVQA) tasks
Introducing a novel dataset (FortisAVQA) for robust multimodal reasoning evaluation
Proposing a debiasing framework (MAVEN) to improve AVQA model robustness
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces FortisAVQA dataset with diverse test questions
Proposes MAVEN framework for multimodal debiasing
Achieves 7.81% performance improvement on AVQA
🔎 Similar Papers
No similar papers found.
J
Jie Ma
Ministry of Education of Key Laboratory for Intelligent Networks and Network Security, School of Cyber Science and Engineering, Xi’an Jiaotong University, Xi’an, Shaanxi 710049, China
Z
Zhitao Gao
Shannxi Provincial Key Laboratory of Big Data Knowledge Engineering, School of Computer Science and Technology, Xi’an Jiaotong University, Xi’an, Shaanxi 710049, China
Q
Qi Chai
Information Hub, Hong Kong University of Science and Technology (Guangzhou), Guangzhou, Guangdong, 510000, China
J
Jun Liu
Shannxi Provincial Key Laboratory of Big Data Knowledge Engineering, School of Computer Science and Technology, Xi’an Jiaotong University, Xi’an, Shaanxi 710049, China
Pinghui Wang
Pinghui Wang
Xi'an Jiaotong University
J
Jing Tao
Ministry of Education of Key Laboratory for Intelligent Networks and Network Security, School of Cyber Science and Engineering, Xi’an Jiaotong University, Xi’an, Shaanxi 710049, China
Zhou Su
Zhou Su
Xi'an Jiaotong University