FigStep: Jailbreaking Large Vision-language Models via Typographic Visual Prompts

πŸ“… 2023-11-09
πŸ›οΈ arXiv.org
πŸ“ˆ Citations: 66
✨ Influential: 23
πŸ“„ PDF
πŸ€– AI Summary
This work exposes a critical vulnerability in large vision-language models (LVLMs): their visual embedding layers lack text-level safety constraints, enabling evasion of alignment mechanisms via typographic images alone. To exploit this flaw, we propose FigStepβ€”the first black-box jailbreaking attack tailored to LVLMs that requires no textual input; instead, it generates semantically preserved typographic images (e.g., styled or layout-aware renderings of text) to reliably elicit harmful outputs. Our key innovation lies in identifying the misalignment gap in visual embeddings and establishing the first cross-modal safety evaluation framework for LVLMs. Evaluated on six mainstream open-source LVLMs, FigStep achieves an average attack success rate of 82.50%, substantially outperforming existing unimodal (text- or image-only) attacks. This work reveals a fundamental deficiency in current LVLM alignment strategies and provides both a critical warning and a technical benchmark for developing robust multimodal safety mechanisms.
πŸ“ Abstract
Large Vision-Language Models (LVLMs) signify a groundbreaking paradigm shift within the Artificial Intelligence (AI) community, extending beyond the capabilities of Large Language Models (LLMs) by assimilating additional modalities (e.g., images). Despite this advancement, the safety of LVLMs remains adequately underexplored, with a potential overreliance on the safety assurances purported by their underlying LLMs. In this paper, we propose FigStep, a straightforward yet effective black-box jailbreak algorithm against LVLMs. Instead of feeding textual harmful instructions directly, FigStep converts the prohibited content into images through typography to bypass the safety alignment. The experimental results indicate that FigStep can achieve an average attack success rate of 82.50% on six promising open-source LVLMs. Not merely to demonstrate the efficacy of FigStep, we conduct comprehensive ablation studies and analyze the distribution of the semantic embeddings to uncover that the reason behind the success of FigStep is the deficiency of safety alignment for visual embeddings. Moreover, we compare FigStep with five text-only jailbreaks and four image-based jailbreaks to demonstrate the superiority of FigStep, i.e., negligible attack costs and better attack performance. Above all, our work reveals that current LVLMs are vulnerable to jailbreak attacks, which highlights the necessity of novel cross-modality safety alignment techniques. Our code and datasets are available at https://github.com/ThuCCSLab/FigStep .
Problem

Research questions and friction points this paper is trying to address.

Large Visual Language Models
Image Security
Prison Escape Vulnerabilities
Innovation

Methods, ideas, or system contributions that make the work stand out.

FigStep
Adversarial Images
Cross-domain Security
πŸ”Ž Similar Papers
No similar papers found.
Y
Yichen Gong
Department of Computer Science and Technology, Tsinghua University
D
Delong Ran
Institute for Network Sciences and Cyberspace, Tsinghua University
J
Jinyuan Liu
Institute for Advanced Study, BNRist, Tsinghua University
C
Conglei Wang
Carnegie Mellon University
Tianshuo Cong
Tianshuo Cong
Tsinghua Shuimu Postdoctoral Scholar
CryptographyDeep learningComputer security
Anyu Wang
Anyu Wang
Institute for Advanced Study, Tsinghua University,
Coding TheoryCryptography
S
Sisi Duan
Institute for Advanced Study, BNRist, Tsinghua University,Zhongguancun Laboratory,National Financial Cryptography Research Center,Shandong Institute of Blockchain
X
Xiaoyun Wang
Institute for Advanced Study, BNRist, Tsinghua University,Zhongguancun Laboratory,National Financial Cryptography Research Center,Shandong Institute of Blockchain,School of Cyber Science and Technology, Shandong University