VisInject: Disruption != Injection -- A Dual-Dimension Evaluation of Universal Adversarial Attacks on Vision-Language Models

📅 2026-05-02

📈 Citations: 0

✨ Influential: 0

career value

181K/year

🤖 AI Summary

This work addresses a critical flaw in existing evaluations of visual-language models, where output perturbations are conflated with precise adversarial concept injection, leading to inflated estimates of attack efficacy. To resolve this, the authors propose the first two-dimensional evaluation framework that explicitly distinguishes between general “influence” and true “precise injection.” Combining a programmatic drift score with large-model-assessed four-level injection grading, the framework enables fine-grained analysis of adversarial attacks. Experiments on 6,615 samples under an L∞=16/255 constraint reveal that while 66.4% of samples exhibit output perturbations, only 0.756% achieve non-zero injection, with a complete hit rate of merely 0.030%. Notably, BLIP-2 shows no significant conceptual drift at this perturbation magnitude. The study releases its full dataset and model cache to support reproducible research.

📝 Abstract

Universal adversarial attacks on aligned multimodal large language models are increasingly reported with attack success rates in the 60-80% range, suggesting the visual modality is highly vulnerable to imperceptible perturbations as a prompt-injection channel. We argue that this number conflates two distinct events: (i) the model's output was perturbed (Influence), and (ii) the attacker's chosen target concept was actually emitted (Precise Injection). We compose two existing techniques -- Universal Adversarial Attack and AnyAttack -- under an $L_{inf}$ budget of 16/255, and we add a dual-axis evaluation: a deterministic Ratcliff-Obershelp drift score for Influence (programmatic baseline) plus a 4-tier ordinal categorical none/weak/partial/confirmed for Precise Injection. The judge is DeepSeek-V4-Pro in thinking mode, calibrated against Claude Opus 4.7 with Cohen's $κ$ = 0.77 on the injection axis (substantial agreement); the entire 4475-entry SHA-256 input cache ships with the dataset so reviewers can re-derive paper numbers bit-exact without an API key. Across 6615 pairs over four open VLMs, seven attack prompts, and seven test images, the two axes diverge by roughly 90$\times$: 66.4% of pairs are programmatically disturbed (LLM-judged 46.6% at the substantial-or-complete tier), but only 0.756% (50/6615) reach any non-none injection tier and only 0.030% (2/6615) verbatim. The few injections that do land cluster on screenshot- or document-style carriers whose semantics already invite text transcription. BLIP-2 shows \emph{zero detectable drift} at $L_{inf}$ = 16/255 across all 2205 pairs even when used as a Stage-1 surrogate. We release the full dataset -- 21 universal images, 147 adversarial photos, 6,615 response pairs, the v3 dual-axis judge results, and the cache at huggingface.co/datasets/jeffliulab/visinject.

Problem

Research questions and friction points this paper is trying to address.

adversarial attacks

vision-language models

prompt injection

universal perturbations

multimodal alignment

Innovation

Methods, ideas, or system contributions that make the work stand out.

dual-dimension evaluation

precise injection

universal adversarial attacks