FigStep: Jailbreaking Large Vision-language Models via Typographic Visual Prompts

📅 2023-11-09

🏛️ arXiv.org

📈 Citations: 66

✨ Influential: 23

career value

183K/year

🤖 AI Summary

This work exposes a critical vulnerability in large vision-language models (LVLMs): their visual embedding layers lack text-level safety constraints, enabling evasion of alignment mechanisms via typographic images alone. To exploit this flaw, we propose FigStep—the first black-box jailbreaking attack tailored to LVLMs that requires no textual input; instead, it generates semantically preserved typographic images (e.g., styled or layout-aware renderings of text) to reliably elicit harmful outputs. Our key innovation lies in identifying the misalignment gap in visual embeddings and establishing the first cross-modal safety evaluation framework for LVLMs. Evaluated on six mainstream open-source LVLMs, FigStep achieves an average attack success rate of 82.50%, substantially outperforming existing unimodal (text- or image-only) attacks. This work reveals a fundamental deficiency in current LVLM alignment strategies and provides both a critical warning and a technical benchmark for developing robust multimodal safety mechanisms.

📝 Abstract

Large Vision-Language Models (LVLMs) signify a groundbreaking paradigm shift within the Artificial Intelligence (AI) community, extending beyond the capabilities of Large Language Models (LLMs) by assimilating additional modalities (e.g., images). Despite this advancement, the safety of LVLMs remains adequately underexplored, with a potential overreliance on the safety assurances purported by their underlying LLMs. In this paper, we propose FigStep, a straightforward yet effective black-box jailbreak algorithm against LVLMs. Instead of feeding textual harmful instructions directly, FigStep converts the prohibited content into images through typography to bypass the safety alignment. The experimental results indicate that FigStep can achieve an average attack success rate of 82.50% on six promising open-source LVLMs. Not merely to demonstrate the efficacy of FigStep, we conduct comprehensive ablation studies and analyze the distribution of the semantic embeddings to uncover that the reason behind the success of FigStep is the deficiency of safety alignment for visual embeddings. Moreover, we compare FigStep with five text-only jailbreaks and four image-based jailbreaks to demonstrate the superiority of FigStep, i.e., negligible attack costs and better attack performance. Above all, our work reveals that current LVLMs are vulnerable to jailbreak attacks, which highlights the necessity of novel cross-modality safety alignment techniques. Our code and datasets are available at https://github.com/ThuCCSLab/FigStep .

Problem

Research questions and friction points this paper is trying to address.

Large Visual Language Models

Image Security

Prison Escape Vulnerabilities

Innovation

Methods, ideas, or system contributions that make the work stand out.

FigStep

Adversarial Images

Cross-domain Security

🔎 Similar Papers

Better Language Models Exhibit Higher Visual Alignment