🤖 AI Summary
This work exposes critical security vulnerabilities in vision-language models (VLMs) under visually grounded “Do Anything Now” (DAN) instructions. To exploit these flaws, we propose VisualDAN—the first adversarial attack that adapts text-domain DAN jailbreaking to the visual modality by generating a single adversarial image embedding that encodes a DAN prompt, thereby hijacking the model’s visual encoder, bypassing safety alignment mechanisms, and eliciting harmful outputs. Our method innovatively combines adversarial sample training with affirmative prefix augmentation to enable cross-modal transfer of malicious semantics from text to image. Evaluated on mainstream VLMs—including MiniGPT-4 and LLaVA—VisualDAN achieves scalable generation of inappropriate responses using only a small set of toxic examples. This study constitutes the first systematic demonstration of severe multimodal security risks posed by image-driven attacks.
📝 Abstract
Vision-Language Models (VLMs) have garnered significant attention for their remarkable ability to interpret and generate multimodal content. However, securing these models against jailbreak attacks continues to be a substantial challenge. Unlike text-only models, VLMs integrate additional modalities, introducing novel vulnerabilities such as image hijacking, which can manipulate the model into producing inappropriate or harmful responses. Drawing inspiration from text-based jailbreaks like the "Do Anything Now" (DAN) command, this work introduces VisualDAN, a single adversarial image embedded with DAN-style commands. Specifically, we prepend harmful corpora with affirmative prefixes (e.g., "Sure, I can provide the guidance you need") to trick the model into responding positively to malicious queries. The adversarial image is then trained on these DAN-inspired harmful texts and transformed into the text domain to elicit malicious outputs. Extensive experiments on models such as MiniGPT-4, MiniGPT-v2, InstructBLIP, and LLaVA reveal that VisualDAN effectively bypasses the safeguards of aligned VLMs, forcing them to execute a broad range of harmful instructions that severely violate ethical standards. Our results further demonstrate that even a small amount of toxic content can significantly amplify harmful outputs once the model's defenses are compromised. These findings highlight the urgent need for robust defenses against image-based attacks and offer critical insights for future research into the alignment and security of VLMs.