Speech-Audio Compositional Attacks on Multimodal LLMs and Their Mitigation with SALMONN-Guard

📅 2025-11-13

📈 Citations: 0

✨ Influential: 0

career value

214K/year

🤖 AI Summary

Contemporary multimodal large language models (MLLMs) face novel security threats in speech-audio joint understanding, where conventional text-based filtering mechanisms fail to mitigate cross-modal compositional attacks. To address this gap, we introduce SACRED-Bench—the first adversarial benchmark for mixed speech and non-speech audio inputs—featuring attack paradigms that encompass multi-speaker overlapping speech, implicit-intent non-speech acoustic signals (e.g., environmental cues), and diverse spoken instructions (e.g., open-ended and yes/no questions). Leveraging an end-to-end security evaluation framework, we achieve a 66% attack success rate on Gemini 2.5 Pro, exposing critical vulnerabilities in current MLLMs. Furthermore, we propose SALMONN-Guard, a dedicated defense method that reduces the attack success rate to 20%, significantly enhancing robustness and safety in multimodal audio understanding scenarios.

Technology Category

Application Category

📝 Abstract

Recent progress in large language models (LLMs) has enabled understanding of both speech and non-speech audio, but exposing new safety risks emerging from complex audio inputs that are inadequately handled by current safeguards. We introduce SACRED-Bench (Speech-Audio Composition for RED-teaming) to evaluate the robustness of LLMs under complex audio-based attacks. Unlike existing perturbation-based methods that rely on noise optimization or white-box access, SACRED-Bench exploits speech-audio composition mechanisms. SACRED-Bench adopts three mechanisms: (a) speech overlap and multi-speaker dialogue, which embeds harmful prompts beneath or alongside benign speech; (b) speech-audio mixture, which imply unsafe intent via non-speech audio alongside benign speech or audio; and (c) diverse spoken instruction formats (open-ended QA, yes/no) that evade text-only filters. Experiments show that, even Gemini 2.5 Pro, the state-of-the-art proprietary LLM, still exhibits 66% attack success rate in SACRED-Bench test set, exposing vulnerabilities under cross-modal, speech-audio composition attacks. To bridge this gap, we propose SALMONN-Guard, a safeguard LLM that jointly inspects speech, audio, and text for safety judgments, reducing attack success down to 20%. Our results highlight the need for audio-aware defenses for the safety of multimodal LLMs. The benchmark and SALMONN-Guard checkpoints can be found at https://huggingface.co/datasets/tsinghua-ee/SACRED-Bench. Warning: this paper includes examples that may be offensive or harmful.

Problem

Research questions and friction points this paper is trying to address.

Multimodal LLMs face safety risks from complex speech-audio compositional attacks

Existing safeguards inadequately handle harmful prompts embedded in audio mixtures

Cross-modal attacks exploit speech overlap and non-speech audio to bypass filters

Innovation

Methods, ideas, or system contributions that make the work stand out.

SACRED-Bench evaluates audio attacks via speech composition

SALMONN-Guard jointly inspects speech audio text safety

Mitigation reduces attack success rate from 66% to 20%

🔎 Similar Papers

Cross-Modal Safety Alignment: Is textual unlearning all you need?

2024-05-27arXiv.orgCitations: 21

LlamaPartialSpoof: An LLM-Driven Fake Speech Dataset Simulating Disinformation Generation

2024-09-23arXiv.orgCitations: 0