🤖 AI Summary
Existing audio-visual models lack systematic robustness evaluation under joint audio-visual distribution shifts at test time; mainstream benchmarks only cover unimodal perturbations, failing to reflect realistic co-occurring audio-visual degradations.
Method: We introduce AV-Robust—the first benchmark for evaluating robustness under test-time joint audio-visual distribution shifts—comprising four multimodal datasets and 75 semantically aligned, co-occurring perturbations. We propose a coordinated audio-visual perturbation generation paradigm and design AV2C, an online cross-modal test-time adaptation method that fuses cross-modal features and incorporates high-entropy sample penalization to enhance adaptation.
Contribution/Results: Experiments reveal a substantial decline in mainstream models’ robustness under increasing joint perturbations. AV2C achieves a +3.2% average accuracy gain on VGGSOUND-2C, significantly outperforming existing test-time adaptation approaches.
📝 Abstract
While recent audio-visual models have demonstrated impressive performance, their robustness to distributional shifts at test-time remains not fully understood. Existing robustness benchmarks mainly focus on single modalities, making them insufficient for thoroughly assessing the robustness of audio-visual models. Motivated by real-world scenarios where shifts can occur $ extit{simultaneously}$ in both audio and visual modalities, we introduce $ exttt{AVROBUSTBENCH}$, a comprehensive benchmark designed to evaluate the test-time robustness of audio-visual recognition models. $ exttt{AVROBUSTBENCH}$ comprises four audio-visual benchmark datasets, $ exttt{AUDIOSET-2C}$, $ exttt{VGGSOUND-2C}$, $ exttt{KINETICS-2C}$, and $ exttt{EPICKITCHENS-2C}$, each incorporating 75 bimodal audio-visual corruptions that are $ extit{co-occurring}$ and $ extit{correlated}$. Through extensive evaluations, we observe that state-of-the-art supervised and self-supervised audio-visual models exhibit declining robustness as corruption severity increases. Furthermore, online test-time adaptation (TTA) methods, on $ exttt{VGGSOUND-2C}$ and $ exttt{KINETICS-2C}$, offer minimal improvements in performance under bimodal corruptions. We further propose $ exttt{AV2C}$, a simple TTA approach enabling on-the-fly cross-modal fusion by penalizing high-entropy samples, which achieves improvements on $ exttt{VGGSOUND-2C}$. We hope that $ exttt{AVROBUSTBENCH}$ will steer the development of more effective and robust audio-visual TTA approaches. Our code is available $href{https://github.com/sarthaxxxxx/AV-C-Robustness-Benchmark}{here}$.