Web Artifact Attacks Disrupt Vision Language Models

📅 2025-03-17

📈 Citations: 0

✨ Influential: 0

career value

172K/year

🤖 AI Summary

This work identifies an unintended semantic-visual association in vision-language models (VLMs) induced by brand混杂 in large-scale web-sourced image-text data, causing models to rely on spurious statistical correlations rather than genuine understanding. Such associations enable “artifact attacks”—adversarial manipulations exploiting mismatched textual descriptions and graphical artifacts (e.g., logos, icons) that are difficult to predefine and highly transferable. We formally define and empirically validate this generalized artifact attack paradigm, extending beyond conventional typographic attacks. Methodologically, the attack employs black-box optimization to generate imperceptible perturbations, while the defense introduces artifact-aware prompting tailored for graphical contexts, integrating prompt engineering with data-aware reasoning. Evaluated across five benchmarks, the attack achieves up to 100% success rate and 90% cross-model transferability; our defense reduces attack success by an average of 15%, demonstrating feasible robustness improvement.

Technology Category

Application Category

📝 Abstract

Vision-language models (VLMs) (e.g., CLIP, LLaVA) are trained on large-scale, lightly curated web datasets, leading them to learn unintended correlations between semantic concepts and unrelated visual signals. These associations degrade model accuracy by causing predictions to rely on incidental patterns rather than genuine visual understanding. Prior work has weaponized these correlations as an attack vector to manipulate model predictions, such as inserting a deceiving class text onto the image in a typographic attack. These attacks succeed due to VLMs' text-heavy bias-a result of captions that echo visible words rather than describing content. However, this attack has focused solely on text that matches the target class exactly, overlooking a broader range of correlations, including non-matching text and graphical symbols, which arise from the abundance of branding content in web-scale data. To address this gap, we introduce artifact-based attacks: a novel class of manipulations that mislead models using both non-matching text and graphical elements. Unlike typographic attacks, these artifacts are not predefined, making them harder to defend against but also more challenging to find. We address this by framing artifact attacks as a search problem and demonstrate their effectiveness across five datasets, with some artifacts reinforcing each other to reach 100% attack success rates. These attacks transfer across models with up to 90% effectiveness, making it possible to attack unseen models. To defend against these attacks, we extend prior work's artifact aware prompting to the graphical setting. We see a moderate reduction of success rates of up to 15% relative to standard prompts, suggesting a promising direction for enhancing model robustness.

Problem

Research questions and friction points this paper is trying to address.

Address unintended correlations in vision-language models.

Introduce artifact-based attacks using non-matching text and graphics.

Propose defenses to enhance model robustness against attacks.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces artifact-based attacks using non-matching text

Frames artifact attacks as a search problem

Extends artifact-aware prompting to graphical settings

🔎 Similar Papers

Vision-LLMs Can Fool Themselves with Self-Generated Typographic Attacks