$ exttt{AVROBUSTBENCH}$: Benchmarking the Robustness of Audio-Visual Recognition Models at Test-Time

📅 2025-05-31

📈 Citations: 0

✨ Influential: 0

career value

208K/year

🤖 AI Summary

Existing audio-visual models lack systematic robustness evaluation under joint audio-visual distribution shifts at test time; mainstream benchmarks only cover unimodal perturbations, failing to reflect realistic co-occurring audio-visual degradations. Method: We introduce AV-Robust—the first benchmark for evaluating robustness under test-time joint audio-visual distribution shifts—comprising four multimodal datasets and 75 semantically aligned, co-occurring perturbations. We propose a coordinated audio-visual perturbation generation paradigm and design AV2C, an online cross-modal test-time adaptation method that fuses cross-modal features and incorporates high-entropy sample penalization to enhance adaptation. Contribution/Results: Experiments reveal a substantial decline in mainstream models’ robustness under increasing joint perturbations. AV2C achieves a +3.2% average accuracy gain on VGGSOUND-2C, significantly outperforming existing test-time adaptation approaches.

Technology Category

Application Category

📝 Abstract

While recent audio-visual models have demonstrated impressive performance, their robustness to distributional shifts at test-time remains not fully understood. Existing robustness benchmarks mainly focus on single modalities, making them insufficient for thoroughly assessing the robustness of audio-visual models. Motivated by real-world scenarios where shifts can occur $ extit{simultaneously}$ in both audio and visual modalities, we introduce $ exttt{AVROBUSTBENCH}$, a comprehensive benchmark designed to evaluate the test-time robustness of audio-visual recognition models. $ exttt{AVROBUSTBENCH}$ comprises four audio-visual benchmark datasets, $ exttt{AUDIOSET-2C}$, $ exttt{VGGSOUND-2C}$, $ exttt{KINETICS-2C}$, and $ exttt{EPICKITCHENS-2C}$, each incorporating 75 bimodal audio-visual corruptions that are $ extit{co-occurring}$ and $ extit{correlated}$. Through extensive evaluations, we observe that state-of-the-art supervised and self-supervised audio-visual models exhibit declining robustness as corruption severity increases. Furthermore, online test-time adaptation (TTA) methods, on $ exttt{VGGSOUND-2C}$ and $ exttt{KINETICS-2C}$, offer minimal improvements in performance under bimodal corruptions. We further propose $ exttt{AV2C}$, a simple TTA approach enabling on-the-fly cross-modal fusion by penalizing high-entropy samples, which achieves improvements on $ exttt{VGGSOUND-2C}$. We hope that $ exttt{AVROBUSTBENCH}$ will steer the development of more effective and robust audio-visual TTA approaches. Our code is available $href{https://github.com/sarthaxxxxx/AV-C-Robustness-Benchmark}{here}$.

Problem

Research questions and friction points this paper is trying to address.

Assessing robustness of audio-visual models to distributional shifts

Evaluating co-occurring corruptions in audio and visual modalities

Improving test-time adaptation for bimodal corruptions

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces AVROBUSTBENCH for audio-visual robustness benchmarking

Includes 75 co-occurring bimodal audio-visual corruptions

Proposes AV2C for on-the-fly cross-modal fusion

🔎 Similar Papers

B-AVIBench: Towards Evaluating the Robustness of Large Vision-Language Model on Black-box Adversarial Visual-Instructions

2024-03-14IEEE Transactions on Information Forensics and SecurityCitations: 16

Bosch Group

Renningen, BW, DE

Research Scientist Intern, Multimodal AI (PhD)