🤖 AI Summary
Despite safety alignment, large vision-language models (LVLMs) remain vulnerable to multimodal jailbreaking attacks via their visual modality, enabling generation of harmful content.
Method: We propose the first structured jailbreaking attack leveraging automatically generated flowcharts. Our approach introduces a novel visual prompting paradigm: “text → step-wise description → multi-topology flowchart (vertical/horizontal/S-shaped),” revealing that embedding harmful intent directly into flowchart structure suffices for effective jailbreaking. We further incorporate font-style adversarial analysis and multimodal prompt fusion to enhance attack efficacy.
Contribution/Results: The method achieves >90% success rates across four major LVLMs—including Gemini-1.5 Pro—demonstrating strong generalizability. Even font-style manipulation alone boosts Claude-3.5 Sonnet’s jailbreak rate by 24 percentage points. We also evaluate the defense AdaShield, confirming its mitigation capability but at substantial cost to model utility. This work exposes critical structural vulnerabilities in LVLM visual reasoning and highlights urgent needs for robust multimodal safety mechanisms.
📝 Abstract
Large Vision-Language Models (LVLMs) have become powerful and widely adopted in some practical applications. However, recent research has revealed their vulnerability to multimodal jailbreak attacks, whereby the model can be induced to generate harmful content, leading to safety risks. Although most LVLMs have undergone safety alignment, recent research shows that the visual modality is still vulnerable to jailbreak attacks. In our work, we discover that by using flowcharts with partially harmful information, LVLMs can be induced to provide additional harmful details. Based on this, we propose a jailbreak attack method based on auto-generated flowcharts, FC-Attack. Specifically, FC-Attack first fine-tunes a pre-trained LLM to create a step-description generator based on benign datasets. The generator is then used to produce step descriptions corresponding to a harmful query, which are transformed into flowcharts in 3 different shapes (vertical, horizontal, and S-shaped) as visual prompts. These flowcharts are then combined with a benign textual prompt to execute a jailbreak attack on LVLMs. Our evaluations using the Advbench dataset show that FC-Attack achieves over 90% attack success rates on Gemini-1.5, Llaval-Next, Qwen2-VL, and InternVL-2.5 models, outperforming existing LVLM jailbreak methods. Additionally, we investigate factors affecting the attack performance, including the number of steps and the font styles in the flowcharts. Our evaluation shows that FC-Attack can improve the jailbreak performance from 4% to 28% in Claude-3.5 by changing the font style. To mitigate the attack, we explore several defenses and find that AdaShield can largely reduce the jailbreak performance but with the cost of utility drop.