🤖 AI Summary
Vision-language models (VLMs) are vulnerable to implicit adversarial attacks that manipulate semantic interpretation without altering visual appearance. Method: We propose “self-explaining adversarial images”—a novel attack injecting image-level soft prompts to implicitly embed non-textualizable meta-instructions (e.g., political bias, disinformation cues) into cross-modal representations, thereby coercing model outputs toward predefined styles, sentiments, or stances while preserving visual fidelity. Contribution/Results: This is the first work to enable end-to-end differentiable soft prompt injection directly in pixel space. We design a unified white-box and black-box attack framework compatible with mainstream VLMs including LLaVA and Qwen-VL. Across multiple models and tasks, our method achieves >85% attack success rate; generated adversarial images exhibit no perceptible artifacts, and model responses remain fluent and semantically coherent—yet are reliably steered by the injected meta-intent. Our findings expose a previously unrecognized vulnerability in VLM cross-modal alignment.
📝 Abstract
We introduce a new type of indirect, cross-modal injection attacks against visual language models that enable creation of self-interpreting images. These images contain hidden"meta-instructions"that control how models answer users' questions about the image and steer their outputs to express an adversary-chosen style, sentiment, or point of view. Self-interpreting images act as soft prompts, conditioning the model to satisfy the adversary's (meta-)objective while still producing answers based on the image's visual content. Meta-instructions are thus a stronger form of prompt injection. Adversarial images look natural and the model's answers are coherent and plausible--yet they also follow the adversary-chosen interpretation, e.g., political spin, or even objectives that are not achievable with explicit text instructions. We evaluate the efficacy of self-interpreting images for a variety of models, interpretations, and user prompts. We describe how these attacks could cause harm by enabling creation of self-interpreting content that carries spam, misinformation, or spin. Finally, we discuss defenses.