Typographic Attacks in a Multi-Image Setting

📅 2025-02-12

📈 Citations: 0

✨ Influential: 0

career value

191K/year

🤖 AI Summary

This work addresses the vulnerability of large vision-language models (LVLMs) to layout-based attacks under multi-image input settings. We propose, for the first time, a **multi-image non-repetitive textual attack paradigm**, wherein imperceptible textual perturbations are jointly injected across an image set without reusing identical adversarial text. Methodologically, we design a dual-strategy optimization framework guided by difficulty-aware weighting and CLIP-based text–image similarity, enabling adaptive control of attack strength and cross-model transferability validation (from CLIP to InstructBLIP). Experiments on ImageNet demonstrate a 21% absolute improvement in attack success rate against CLIP over random baselines, with strong transferability to downstream LVLMs. Our approach transcends the limitations of single-image attacks, significantly enhancing both the stealthiness and practical threat posed by adversarial examples in multi-image vision-language understanding.

Technology Category

Application Category

📝 Abstract

Large Vision-Language Models (LVLMs) are susceptible to typographic attacks, which are misclassifications caused by an attack text that is added to an image. In this paper, we introduce a multi-image setting for studying typographic attacks, broadening the current emphasis of the literature on attacking individual images. Specifically, our focus is on attacking image sets without repeating the attack query. Such non-repeating attacks are stealthier, as they are more likely to evade a gatekeeper than attacks that repeat the same attack text. We introduce two attack strategies for the multi-image setting, leveraging the difficulty of the target image, the strength of the attack text, and text-image similarity. Our text-image similarity approach improves attack success rates by 21% over random, non-specific methods on the CLIP model using ImageNet while maintaining stealth in a multi-image scenario. An additional experiment demonstrates transferability, i.e., text-image similarity calculated using CLIP transfers when attacking InstructBLIP.

Problem

Research questions and friction points this paper is trying to address.

Study typographic attacks in multi-image settings.

Develop stealthier non-repeating attack strategies.

Improve attack success using text-image similarity.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-image typographic attack strategies

Text-image similarity enhances attack success

Transferable attack techniques across models

🔎 Similar Papers

Typography Leads Semantic Diversifying: Amplifying Adversarial Transferability across Multimodal Large Language Models