Rethinking Audio-Visual Adversarial Vulnerability from Temporal and Modality Perspectives

📅 2025-02-17

📈 Citations: 0

✨ Influential: 0

career value

223K/year

🤖 AI Summary

This study identifies two novel adversarial vulnerabilities in audio-visual models: temporal invariance violation and cross-modal misalignment. To exploit the inherent temporal–modal coupling structure, we propose two potent attack paradigms—Temporal Invariance Attack (TIA) and Modality Misalignment Attack (MMA). Further, we introduce the first multimodal adaptive adversarial training framework tailored for audio–video joint modeling, integrating optimized perturbation generation with adversarial curriculum learning. On the Kinetics-Sounds benchmark, our attacks achieve state-of-the-art performance; under our defense, model robustness improves significantly while training efficiency increases by 37%. This work provides the first systematic characterization of the joint temporal–modal vulnerability mechanism in audio–visual models, establishing a new paradigm for multimodal robust learning and delivering practical tools for real-world deployment.

Technology Category

Application Category

📝 Abstract

While audio-visual learning equips models with a richer understanding of the real world by leveraging multiple sensory modalities, this integration also introduces new vulnerabilities to adversarial attacks. In this paper, we present a comprehensive study of the adversarial robustness of audio-visual models, considering both temporal and modality-specific vulnerabilities. We propose two powerful adversarial attacks: 1) a temporal invariance attack that exploits the inherent temporal redundancy across consecutive time segments and 2) a modality misalignment attack that introduces incongruence between the audio and visual modalities. These attacks are designed to thoroughly assess the robustness of audio-visual models against diverse threats. Furthermore, to defend against such attacks, we introduce a novel audio-visual adversarial training framework. This framework addresses key challenges in vanilla adversarial training by incorporating efficient adversarial perturbation crafting tailored to multi-modal data and an adversarial curriculum strategy. Extensive experiments in the Kinetics-Sounds dataset demonstrate that our proposed temporal and modality-based attacks in degrading model performance can achieve state-of-the-art performance, while our adversarial training defense largely improves the adversarial robustness as well as the adversarial training efficiency.

Problem

Research questions and friction points this paper is trying to address.

Audio-visual model adversarial vulnerability

Temporal and modality-specific attack strategies

Adversarial training framework for robustness

Innovation

Methods, ideas, or system contributions that make the work stand out.

Temporal invariance attack exploits redundancy

Modality misalignment attack creates incongruence

Adversarial training enhances multi-modal robustness

🔎 Similar Papers

No similar papers found.