Exploring Typographic Visual Prompts Injection Threats in Cross-Modality Generation Models

📅 2025-03-14

📈 Citations: 0

✨ Influential: 0

career value

166K/year

🤖 AI Summary

This work identifies a novel cross-modal semantic interference threat—Typography-based Visual Prompt Injection (TVPI)—where typographic text embedded in input images induces large vision-language models (LVLMs) and image-to-image generative models (I2I GMs) to produce semantically consistent yet adversarial outputs. To systematically investigate this threat, we construct the first TVPI benchmark dataset and conduct comprehensive evaluations across 20+ state-of-the-art models. Our experiments empirically demonstrate, for the first time, the cross-architectural generalizability of TVPI across both vision-language understanding (VLU) and I2I generation tasks. We further propose a principled TVPI threat taxonomy and an attribution analysis framework, identifying tight coupling between textual and visual features as the root cause. Collectively, these findings provide critical empirical evidence and an interpretable analytical paradigm for enhancing the safety and alignment of multimodal foundation models.

Technology Category

Application Category

📝 Abstract

Current Cross-Modality Generation Models (GMs) demonstrate remarkable capabilities in various generative tasks. Given the ubiquity and information richness of vision modality inputs in real-world scenarios, Cross-vision, encompassing Vision-Language Perception (VLP) and Image-to-Image (I2I), tasks have attracted significant attention. Large Vision Language Models (LVLMs) and I2I GMs are employed to handle VLP and I2I tasks, respectively. Previous research indicates that printing typographic words into input images significantly induces LVLMs and I2I GMs to generate disruptive outputs semantically related to those words. Additionally, visual prompts, as a more sophisticated form of typography, are also revealed to pose security risks to various applications of VLP tasks when injected into images. In this paper, we comprehensively investigate the performance impact induced by Typographic Visual Prompt Injection (TVPI) in various LVLMs and I2I GMs. To better observe performance modifications and characteristics of this threat, we also introduce the TVPI Dataset. Through extensive explorations, we deepen the understanding of the underlying causes of the TVPI threat in various GMs and offer valuable insights into its potential origins.

Problem

Research questions and friction points this paper is trying to address.

Investigates typographic visual prompt injection threats

Analyzes impact on Vision-Language and Image-to-Image models

Introduces dataset to study performance modifications

Innovation

Methods, ideas, or system contributions that make the work stand out.

Investigates Typographic Visual Prompt Injection threats

Introduces TVPI Dataset for performance analysis

Explores security risks in Vision-Language Models

🔎 Similar Papers

Typography Leads Semantic Diversifying: Amplifying Adversarial Transferability across Multimodal Large Language Models