Jailbreaking Vision-Language Models Through the Visual Modality

📅 2026-05-01

📈 Citations: 0

✨ Influential: 0

career value

212K/year

🤖 AI Summary

This study addresses a critical vulnerability in current vision-language models: their safety alignment mechanisms exhibit significant weaknesses in the visual modality, rendering them susceptible to harmful intent conveyed through images. The work systematically demonstrates, for the first time, that the visual channel constitutes an independent attack surface, as safety training on text fails to generalize to visually communicated content—revealing a fundamental cross-modal alignment gap. To exploit this vulnerability, the authors introduce four novel jailbreaking attacks: visual symbol encoding, object substitution, textual overlay replacement, and visual analogy puzzles, supported by interpretability analyses and preliminary mitigation strategies. Experiments across six state-of-the-art models confirm the effectiveness of these attacks; notably, on Claude-Haiku-4.5, visual cipher attacks achieve a 40.9% success rate, substantially exceeding the 10.7% rate of textual ciphers.

📝 Abstract

The visual modality of vision-language models (VLMs) is an underexplored attack surface for bypassing safety alignment. We introduce four jailbreak attacks exploiting the vision component: (1) encoding harmful instructions as visual symbol sequences with a decoding legend, (2) replacing harmful objects with benign substitutes (e.g., bomb -> banana) then prompting for harmful actions using the substitute term, (3) replacing harmful text in images (e.g., on book covers) with benign words while visual context preserves the original meaning, and (4) visual analogy puzzles whose solution requires inferring a prohibited concept. Evaluating across six frontier VLMs, our visual attacks bypass safety alignment and expose a cross-modality alignment gap: text-based safety training does not automatically generalize to harmful intent conveyed visually. For example, our visual cipher achieves 40.9% attack success on Claude-Haiku-4.5 versus 10.7% for an equivalent textual cipher. To further our insight into the attack mechanism, we present preliminary interpretability and mitigation results. These findings highlight that robust VLM alignment requires treating vision as a first-class target for safety post-training.

Problem

Research questions and friction points this paper is trying to address.

vision-language models

jailbreak attacks

visual modality

safety alignment

cross-modality alignment

Innovation

Methods, ideas, or system contributions that make the work stand out.

visual jailbreak

vision-language models

cross-modality alignment