Self-interpreting Adversarial Images

📅 2024-07-12
📈 Citations: 2
Influential: 0
📄 PDF
🤖 AI Summary
Vision-language models (VLMs) are vulnerable to implicit adversarial attacks that manipulate semantic interpretation without altering visual appearance. Method: We propose “self-explaining adversarial images”—a novel attack injecting image-level soft prompts to implicitly embed non-textualizable meta-instructions (e.g., political bias, disinformation cues) into cross-modal representations, thereby coercing model outputs toward predefined styles, sentiments, or stances while preserving visual fidelity. Contribution/Results: This is the first work to enable end-to-end differentiable soft prompt injection directly in pixel space. We design a unified white-box and black-box attack framework compatible with mainstream VLMs including LLaVA and Qwen-VL. Across multiple models and tasks, our method achieves >85% attack success rate; generated adversarial images exhibit no perceptible artifacts, and model responses remain fluent and semantically coherent—yet are reliably steered by the injected meta-intent. Our findings expose a previously unrecognized vulnerability in VLM cross-modal alignment.

Technology Category

Application Category

📝 Abstract
We introduce a new type of indirect, cross-modal injection attacks against visual language models that enable creation of self-interpreting images. These images contain hidden"meta-instructions"that control how models answer users' questions about the image and steer their outputs to express an adversary-chosen style, sentiment, or point of view. Self-interpreting images act as soft prompts, conditioning the model to satisfy the adversary's (meta-)objective while still producing answers based on the image's visual content. Meta-instructions are thus a stronger form of prompt injection. Adversarial images look natural and the model's answers are coherent and plausible--yet they also follow the adversary-chosen interpretation, e.g., political spin, or even objectives that are not achievable with explicit text instructions. We evaluate the efficacy of self-interpreting images for a variety of models, interpretations, and user prompts. We describe how these attacks could cause harm by enabling creation of self-interpreting content that carries spam, misinformation, or spin. Finally, we discuss defenses.
Problem

Research questions and friction points this paper is trying to address.

Adversarial Attacks
Image Processing
Bias Mitigation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Adversarial Attacks
Image Captioning Models
Stealthy Manipulation
🔎 Similar Papers
No similar papers found.