SoundBreak: A Systematic Study of Audio-Only Adversarial Attacks on Trimodal Models

📅 2026-01-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study investigates the feasibility of launching untargeted adversarial attacks on audio-visual-language trimodal models using only audio inputs, thereby exposing critical security vulnerabilities in multimodal systems. We present the first systematic evaluation of how perturbations in a single audio modality propagate through and disrupt various stages of trimodal architectures—including audio encoding, cross-modal attention mechanisms, and final outputs—and introduce six complementary attack objectives. Leveraging gradient-based adversarial generation methods, we validate our approach on Whisper and several state-of-the-art trimodal models. Experimental results demonstrate attack success rates as high as 96% under low perceptual distortion constraints (LPIPS ≤ 0.08, SI-SNR ≥ 0), with performance exceeding 97% on Whisper, underscoring that strategic optimization yields greater efficacy than merely scaling up data volume.

Technology Category

Application Category

📝 Abstract
Multimodal foundation models that integrate audio, vision, and language achieve strong performance on reasoning and generation tasks, yet their robustness to adversarial manipulation remains poorly understood. We study a realistic and underexplored threat model: untargeted, audio-only adversarial attacks on trimodal audio-video-language models. We analyze six complementary attack objectives that target different stages of multimodal processing, including audio encoder representations, cross-modal attention, hidden states, and output likelihoods. Across three state-of-the-art models and multiple benchmarks, we show that audio-only perturbations can induce severe multimodal failures, achieving up to 96% attack success rate. We further show that attacks can be successful at low perceptual distortions (LPIPS<= 0.08, SI-SNR>= 0) and benefit more from extended optimization than increased data scale. Transferability across models and encoders remains limited, while speech recognition systems such as Whisper primarily respond to perturbation magnitude, achieving>97% attack success under severe distortion. These results expose a previously overlooked single-modality attack surface in multimodal systems and motivate defenses that enforce cross-modal consistency.
Problem

Research questions and friction points this paper is trying to address.

adversarial attacks
audio-only perturbations
trimodal models
multimodal robustness
cross-modal failure
Innovation

Methods, ideas, or system contributions that make the work stand out.

audio-only adversarial attacks
trimodal models
cross-modal attention
perceptual distortion
multimodal robustness
🔎 Similar Papers
No similar papers found.
A
Aafiya Hussain
Department of Computer Science, Virginia Tech, USA
Gaurav Srivastava
Gaurav Srivastava
Graduate Student, Virginia Tech | Dell Technologies
Natural Language ProcessingLarge Language ModelsComplex ReasoningSmall Language Models
A
A. Ishmam
Department of Computer Science, Virginia Tech, USA
Z
Z. Hakim
Department of Computer Science, Virginia Tech, USA
Chris Thomas
Chris Thomas
Virginia Tech
Computer Vision