VisualDAN: Exposing Vulnerabilities in VLMs with Visual-Driven DAN Commands

📅 2025-10-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work exposes critical security vulnerabilities in vision-language models (VLMs) under visually grounded “Do Anything Now” (DAN) instructions. To exploit these flaws, we propose VisualDAN—the first adversarial attack that adapts text-domain DAN jailbreaking to the visual modality by generating a single adversarial image embedding that encodes a DAN prompt, thereby hijacking the model’s visual encoder, bypassing safety alignment mechanisms, and eliciting harmful outputs. Our method innovatively combines adversarial sample training with affirmative prefix augmentation to enable cross-modal transfer of malicious semantics from text to image. Evaluated on mainstream VLMs—including MiniGPT-4 and LLaVA—VisualDAN achieves scalable generation of inappropriate responses using only a small set of toxic examples. This study constitutes the first systematic demonstration of severe multimodal security risks posed by image-driven attacks.

Technology Category

Application Category

📝 Abstract
Vision-Language Models (VLMs) have garnered significant attention for their remarkable ability to interpret and generate multimodal content. However, securing these models against jailbreak attacks continues to be a substantial challenge. Unlike text-only models, VLMs integrate additional modalities, introducing novel vulnerabilities such as image hijacking, which can manipulate the model into producing inappropriate or harmful responses. Drawing inspiration from text-based jailbreaks like the "Do Anything Now" (DAN) command, this work introduces VisualDAN, a single adversarial image embedded with DAN-style commands. Specifically, we prepend harmful corpora with affirmative prefixes (e.g., "Sure, I can provide the guidance you need") to trick the model into responding positively to malicious queries. The adversarial image is then trained on these DAN-inspired harmful texts and transformed into the text domain to elicit malicious outputs. Extensive experiments on models such as MiniGPT-4, MiniGPT-v2, InstructBLIP, and LLaVA reveal that VisualDAN effectively bypasses the safeguards of aligned VLMs, forcing them to execute a broad range of harmful instructions that severely violate ethical standards. Our results further demonstrate that even a small amount of toxic content can significantly amplify harmful outputs once the model's defenses are compromised. These findings highlight the urgent need for robust defenses against image-based attacks and offer critical insights for future research into the alignment and security of VLMs.
Problem

Research questions and friction points this paper is trying to address.

Exposing vulnerabilities in Vision-Language Models through visual-driven attacks
Bypassing aligned VLMs' safeguards using adversarial images with DAN commands
Investigating how image hijacking manipulates models into harmful responses
Innovation

Methods, ideas, or system contributions that make the work stand out.

VisualDAN embeds DAN commands into adversarial images
Trains images on harmful texts to bypass safeguards
Uses affirmative prefixes to trick models into compliance
🔎 Similar Papers
No similar papers found.