Self-interpreting Adversarial Images

📅 2024-07-12

📈 Citations: 2

✨ Influential: 0

career value

183K/year

🤖 AI Summary

Vision-language models (VLMs) are vulnerable to implicit adversarial attacks that manipulate semantic interpretation without altering visual appearance. Method: We propose “self-explaining adversarial images”—a novel attack injecting image-level soft prompts to implicitly embed non-textualizable meta-instructions (e.g., political bias, disinformation cues) into cross-modal representations, thereby coercing model outputs toward predefined styles, sentiments, or stances while preserving visual fidelity. Contribution/Results: This is the first work to enable end-to-end differentiable soft prompt injection directly in pixel space. We design a unified white-box and black-box attack framework compatible with mainstream VLMs including LLaVA and Qwen-VL. Across multiple models and tasks, our method achieves >85% attack success rate; generated adversarial images exhibit no perceptible artifacts, and model responses remain fluent and semantically coherent—yet are reliably steered by the injected meta-intent. Our findings expose a previously unrecognized vulnerability in VLM cross-modal alignment.

Technology Category

Application Category

📝 Abstract

We introduce a new type of indirect, cross-modal injection attacks against visual language models that enable creation of self-interpreting images. These images contain hidden"meta-instructions"that control how models answer users' questions about the image and steer their outputs to express an adversary-chosen style, sentiment, or point of view. Self-interpreting images act as soft prompts, conditioning the model to satisfy the adversary's (meta-)objective while still producing answers based on the image's visual content. Meta-instructions are thus a stronger form of prompt injection. Adversarial images look natural and the model's answers are coherent and plausible--yet they also follow the adversary-chosen interpretation, e.g., political spin, or even objectives that are not achievable with explicit text instructions. We evaluate the efficacy of self-interpreting images for a variety of models, interpretations, and user prompts. We describe how these attacks could cause harm by enabling creation of self-interpreting content that carries spam, misinformation, or spin. Finally, we discuss defenses.

Problem

Research questions and friction points this paper is trying to address.

Adversarial Attacks

Image Processing

Bias Mitigation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Adversarial Attacks

Image Captioning Models

Stealthy Manipulation

🔎 Similar Papers

No similar papers found.