One Perturbation, Two Failure Modes: Probing VLM Safety via Embedding-Guided Typographic Perturbations

📅 2026-04-27

📈 Citations: 0

✨ Influential: 0

career value

182K/year

🤖 AI Summary

This study investigates the safety failure mechanisms of vision-language models (VLMs) under typographic perturbations in image-based text. Observing that specific layouts can circumvent safety alignment, the authors propose multimodal embedding distance as an interpretable proxy metric and, for the first time, establish its strong correlation with attack success rates. Building on this insight, they introduce the CWA-SSA algorithm, which optimizes textual embedding similarity under ℓ∞ constraints and integrates four surrogate models for black-box red-teaming evaluations. Experiments demonstrate that this approach significantly enhances attack success across five prominent VLMs—including GPT-4o and Claude Sonnet 4.5—with effectiveness jointly governed by the model’s safety filtering strength and the degree of visual degradation. The method simultaneously achieves dual failure modes: restoring text readability while undermining safety rejection capabilities.

📝 Abstract

Typographic prompt injection exploits vision language models' (VLMs) ability to read text rendered in images, posing a growing threat as VLMs power autonomous agents. Prior work typically focus on maximizing attack success rate (ASR) but does not explain \emph{why} certain renderings bypass safety alignment. We make two contributions. First, an empirical study across four VLMs including GPT-4o and Claude, twelve font sizes, and ten transformations reveals that multimodal embedding distance strongly predicts ASR ($r{=}{-}0.71$ to ${-}0.93$, $p{<}0.01$), providing an interpretable, model agnostic proxy. Since embedding distance predicts ASR, reducing it should improve attack success, but the relationship is mediated by two factors: perceptual readability (whether the VLM can parse the text) and safety alignment (whether it refuses to comply). Second, we use this as a red teaming tool: we directly maximize image text embedding similarity under bounded $\ell_\infty$ perturbations via CWA-SSA across four surrogate embedding models, stress testing both factors without access to the target model. Experiments across five degradation settings on GPT-4o, Claude Sonnet 4.5, Mistral-Large-3, and Qwen3-VL confirm that optimization recovers readability and reduces safety aligned refusals as two co-occurring effects, with the dominant mechanism depending on the model's safety filter strength and the degree of visual degradation.

Problem

Research questions and friction points this paper is trying to address.

typographic prompt injection

vision language models

safety alignment

multimodal embedding

adversarial perturbations

Innovation

Methods, ideas, or system contributions that make the work stand out.

embedding-guided perturbation

vision-language models

typographic prompt injection