Vision-LLMs Can Fool Themselves with Self-Generated Typographic Attacks

📅 2024-02-01

🏛️ arXiv.org

📈 Citations: 6

✨ Influential: 0

career value

178K/year

🤖 AI Summary

This study exposes a critical vulnerability of Large Vision-Language Models (LVLMs) to self-generated textual interference in multimodal understanding—termed “font attacks”—where semantically misleading text embedded in images misleads the language module, inducing erroneous visual reasoning and classification. We propose two novel attack paradigms: (1) inter-class confusion attacks, leveraging category-level semantic similarity to identify effective perturbations; and (2) reasoning-driven attacks, utilizing multi-step prompt engineering to exploit the LVLM’s own generative capabilities for crafting highly transferable adversarial examples. Evaluated on state-of-the-art LVLMs—including GPT-4V, InstructBLIP, and MiniGPT-4—the attacks reduce image classification accuracy by up to 60% and demonstrate strong cross-model transferability. These findings reveal a fundamental security blind spot in the end-to-end reasoning pipeline of LVLMs, where tight coupling between vision and language modules amplifies susceptibility to textual adversarial manipulation.

Technology Category

Application Category

📝 Abstract

Typographic attacks, adding misleading text to images, can deceive vision-language models (LVLMs). The susceptibility of recent large LVLMs like GPT4-V to such attacks is understudied, raising concerns about amplified misinformation in personal assistant applications. Previous attacks use simple strategies, such as random misleading words, which don't fully exploit LVLMs' language reasoning abilities. We introduce an experimental setup for testing typographic attacks on LVLMs and propose two novel self-generated attacks: (1) Class-based attacks, where the model identifies a similar class to deceive itself, and (2) Reasoned attacks, where an advanced LVLM suggests an attack combining a deceiving class and description. Our experiments show these attacks significantly reduce classification performance by up to 60% and are effective across different models, including InstructBLIP and MiniGPT4. Code: https://github.com/mqraitem/Self-Gen-Typo-Attack

Problem

Research questions and friction points this paper is trying to address.

Explore typographic attacks on LVLMs

Test susceptibility of GPT4-V to attacks

Propose novel self-generated attack strategies

Innovation

Methods, ideas, or system contributions that make the work stand out.

Class-based typographic attacks deceive models.

Reasoned attacks combine deceiving class and description.

Self-generated attacks reduce classification performance significantly.

🔎 Similar Papers

Typography Leads Semantic Diversifying: Amplifying Adversarial Transferability across Multimodal Large Language Models