🤖 AI Summary
Existing audio-visual question answering (AVQA) models suffer from dataset bias, exhibit poor robustness, and lack diagnostic evaluation benchmarks. Method: We introduce MUSIC-AVQA-R—a high-challenge bias-diagnostic benchmark featuring explicit distribution shifts and paraphrased test instances—the first AVQA benchmark designed for rigorous bias diagnosis. We further propose CycDebias, a multi-perspective cyclic collaborative debiasing architecture integrating multimodal feature alignment, cyclic attention collaboration, counterfactual data reconstruction, and modular debiasing training, enabling plug-and-play deployment. Contribution/Results: On MUSIC-AVQA-R, CycDebias achieves state-of-the-art accuracy (+9.32%), significantly improving generalization to both rare and frequent question types. It systematically exposes critical robustness deficiencies in mainstream AVQA models. Extensive experiments demonstrate consistent performance gains across diverse backbone architectures, validating its effectiveness and broad applicability.
📝 Abstract
Audio-Visual Question Answering (AVQA) is a complex multi-modal reasoning task, demanding intelligent systems to accurately respond to natural language queries based on audio-video input pairs. Nevertheless, prevalent AVQA approaches are prone to overlearning dataset biases, resulting in poor robustness. Furthermore, current datasets may not provide a precise diagnostic for these methods. To tackle these challenges, firstly, we propose a novel dataset, MUSIC-AVQA-R, crafted in two steps: rephrasing questions within the test split of a public dataset (MUSIC-AVQA) and subsequently introducing distribution shifts to split questions. The former leads to a large, diverse test space, while the latter results in a comprehensive robustness evaluation on rare, frequent, and overall questions. Secondly, we propose a robust architecture that utilizes a multifaceted cycle collaborative debiasing strategy to overcome bias learning. Experimental results show that this architecture achieves state-of-the-art performance on MUSIC-AVQA-R, notably obtaining a significant improvement of 9.32%. Extensive ablation experiments are conducted on the two datasets mentioned to analyze the component effectiveness within the debiasing strategy. Additionally, we highlight the limited robustness of existing multi-modal QA methods through the evaluation on our dataset. We also conduct experiments combining various baselines with our proposed strategy on two datasets to verify its plug-and-play capability. Our dataset and code are available at https://github.com/reml-group/MUSIC-AVQA-R.