From Compliance to Exploitation: Jailbreak Prompt Attacks on Multimodal LLMs

📅 2025-02-02

📈 Citations: 0

✨ Influential: 0

career value

212K/year

🤖 AI Summary

This paper identifies a novel security vulnerability in multimodal large language models (MLLMs) under speech-text co-input scenarios. Addressing the limitation of existing safety alignment mechanisms against rich-context speech-based attacks, we introduce “flanking attacks”—a speech-driven jailbreaking paradigm that circumvents safety guards by constructing fictional narrative contexts, jointly injecting benign speech and text prompts, strategically segmenting and scheduling inputs, and camouflaging prohibited instructions within contextual embeddings. Our contributions include: (1) the first systematic methodology for speech-augmented multimodal jailbreaking; (2) a “flanking” prompt architecture and human-like interactive framework; and (3) a semi-automated policy-violation self-evaluation benchmark. Evaluated across seven prohibited domains, our approach achieves average attack success rates of 0.67–0.93, demonstrating a critical security gap in current MLLMs for speech-enhanced multimodal interaction.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) have seen widespread applications across various domains due to their growing ability to process diverse types of input data, including text, audio, image and video. While LLMs have demonstrated outstanding performance in understanding and generating contexts for different scenarios, they are vulnerable to prompt-based attacks, which are mostly via text input. In this paper, we introduce the first voice-based jailbreak attack against multimodal LLMs, termed as Flanking Attack, which can process different types of input simultaneously towards the multimodal LLMs. Our work is motivated by recent advancements in monolingual voice-driven large language models, which have introduced new attack surfaces beyond traditional text-based vulnerabilities for LLMs. To investigate these risks, we examine the frontier multimodal LLMs, which can be accessed via different types of inputs such as audio input, focusing on how adversarial prompts can bypass its defense mechanisms. We propose a novel strategy, in which the disallowed prompt is flanked by benign, narrative-driven prompts. It is integrated in the Flanking Attack which attempts to humanizes the interaction context and execute the attack through a fictional setting. To better evaluate the attack performance, we present a semi-automated self-assessment framework for policy violation detection. We demonstrate that Flank Attack is capable of manipulating state-of-the-art LLMs into generating misaligned and forbidden outputs, which achieves an average attack success rate ranging from 0.67 to 0.93 across seven forbidden scenarios. These findings highlight both the potency of prompt-based obfuscation in voice-enabled contexts and the limitations of current LLMs' moderation safeguards and the urgent need for advanced defense strategies to address the challenges posed by evolving, context-rich attacks.

Problem

Research questions and friction points this paper is trying to address.

Language Models

Adversarial Attacks

Security Vulnerabilities

Innovation

Methods, ideas, or system contributions that make the work stand out.

Flanking Attack

Surrounding Story Method

Semi-automatic Self-check Framework

🔎 Similar Papers

No similar papers found.