🤖 AI Summary
This study investigates the feasibility of launching untargeted adversarial attacks on audio-visual-language trimodal models using only audio inputs, thereby exposing critical security vulnerabilities in multimodal systems. We present the first systematic evaluation of how perturbations in a single audio modality propagate through and disrupt various stages of trimodal architectures—including audio encoding, cross-modal attention mechanisms, and final outputs—and introduce six complementary attack objectives. Leveraging gradient-based adversarial generation methods, we validate our approach on Whisper and several state-of-the-art trimodal models. Experimental results demonstrate attack success rates as high as 96% under low perceptual distortion constraints (LPIPS ≤ 0.08, SI-SNR ≥ 0), with performance exceeding 97% on Whisper, underscoring that strategic optimization yields greater efficacy than merely scaling up data volume.
📝 Abstract
Multimodal foundation models that integrate audio, vision, and language achieve strong performance on reasoning and generation tasks, yet their robustness to adversarial manipulation remains poorly understood. We study a realistic and underexplored threat model: untargeted, audio-only adversarial attacks on trimodal audio-video-language models. We analyze six complementary attack objectives that target different stages of multimodal processing, including audio encoder representations, cross-modal attention, hidden states, and output likelihoods. Across three state-of-the-art models and multiple benchmarks, we show that audio-only perturbations can induce severe multimodal failures, achieving up to 96% attack success rate. We further show that attacks can be successful at low perceptual distortions (LPIPS<= 0.08, SI-SNR>= 0) and benefit more from extended optimization than increased data scale. Transferability across models and encoders remains limited, while speech recognition systems such as Whisper primarily respond to perturbation magnitude, achieving>97% attack success under severe distortion. These results expose a previously overlooked single-modality attack surface in multimodal systems and motivate defenses that enforce cross-modal consistency.