Look, Listen, and Answer: Overcoming Biases for Audio-Visual Question Answering

📅 2024-04-18
🏛️ Neural Information Processing Systems
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing audio-visual question answering (AVQA) models suffer from dataset bias, exhibit poor robustness, and lack diagnostic evaluation benchmarks. Method: We introduce MUSIC-AVQA-R—a high-challenge bias-diagnostic benchmark featuring explicit distribution shifts and paraphrased test instances—the first AVQA benchmark designed for rigorous bias diagnosis. We further propose CycDebias, a multi-perspective cyclic collaborative debiasing architecture integrating multimodal feature alignment, cyclic attention collaboration, counterfactual data reconstruction, and modular debiasing training, enabling plug-and-play deployment. Contribution/Results: On MUSIC-AVQA-R, CycDebias achieves state-of-the-art accuracy (+9.32%), significantly improving generalization to both rare and frequent question types. It systematically exposes critical robustness deficiencies in mainstream AVQA models. Extensive experiments demonstrate consistent performance gains across diverse backbone architectures, validating its effectiveness and broad applicability.

Technology Category

Application Category

📝 Abstract
Audio-Visual Question Answering (AVQA) is a complex multi-modal reasoning task, demanding intelligent systems to accurately respond to natural language queries based on audio-video input pairs. Nevertheless, prevalent AVQA approaches are prone to overlearning dataset biases, resulting in poor robustness. Furthermore, current datasets may not provide a precise diagnostic for these methods. To tackle these challenges, firstly, we propose a novel dataset, MUSIC-AVQA-R, crafted in two steps: rephrasing questions within the test split of a public dataset (MUSIC-AVQA) and subsequently introducing distribution shifts to split questions. The former leads to a large, diverse test space, while the latter results in a comprehensive robustness evaluation on rare, frequent, and overall questions. Secondly, we propose a robust architecture that utilizes a multifaceted cycle collaborative debiasing strategy to overcome bias learning. Experimental results show that this architecture achieves state-of-the-art performance on MUSIC-AVQA-R, notably obtaining a significant improvement of 9.32%. Extensive ablation experiments are conducted on the two datasets mentioned to analyze the component effectiveness within the debiasing strategy. Additionally, we highlight the limited robustness of existing multi-modal QA methods through the evaluation on our dataset. We also conduct experiments combining various baselines with our proposed strategy on two datasets to verify its plug-and-play capability. Our dataset and code are available at https://github.com/reml-group/MUSIC-AVQA-R.
Problem

Research questions and friction points this paper is trying to address.

Overcoming dataset biases in Audio-Visual Question Answering (AVQA).
Proposing a novel dataset, MUSIC-AVQA-R, for robust evaluation.
Developing a debiasing strategy to improve AVQA model robustness.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Developed MUSIC-AVQA-R dataset for robust evaluation
Introduced multifaceted cycle collaborative debiasing strategy
Achieved 9.32% performance improvement on AVQA
🔎 Similar Papers
No similar papers found.
J
Jie Ma
MOE KLINNS Lab, Xi’an Jiaotong University, China
M
Min Hu
MOE KLINNS Lab, Xi’an Jiaotong University, China; China Mobile System Integration Co.
Pinghui Wang
Pinghui Wang
Xi'an Jiaotong University
W
Wangchun Sun
MOE KLINNS Lab, Xi’an Jiaotong University, China
Lingyun Song
Lingyun Song
School of Computer Science, Northwestern Polytechnical University, China
Hongbin Pei
Hongbin Pei
Xi'an Jiaotong University
Machine learningData miningGraph-structured dataComplex network
J
Jun Liu
MOE KLINNS Lab, Xi’an Jiaotong University, China; School of Computer Science and Technology, Xi’an Jiaotong University, China
Y
Youtian Du
MOE KLINNS Lab, Xi’an Jiaotong University, China