Reading Between the Pixels: Linking Text-Image Embedding Alignment to Typographic Attack Success on Vision-Language Models

📅 2026-04-14

📈 Citations: 0

✨ Influential: 0

career value

203K/year

🤖 AI Summary

This study systematically investigates the security vulnerabilities of vision-language models under typographic prompt injection attacks. By integrating multimodal embedding models (JinaCLIP, Qwen3-VL-Embedding) with diverse visual perturbations—such as rotation, blurring, and noise—the authors evaluate the attack success rates of four prominent models across varying font sizes and image degradation conditions on the SALAD-Bench benchmark. The work reveals, for the first time, a strong negative correlation (r = –0.71 to –0.93) between text-image embedding distance and attack success rate, identifies medium-sized fonts as most exploitable, and demonstrates that GPT-4o and Claude exhibit heightened sensitivity to textual attacks. Furthermore, severe visual degradation reduces attack efficacy by 34%–96%. These findings underscore the infeasibility of universal defenses and highlight the necessity of model-specific mitigation strategies.

Technology Category

Application Category

📝 Abstract

We study typographic prompt injection attacks on vision-language models (VLMs), where adversarial text is rendered as images to bypass safety mechanisms, posing a growing threat as VLMs serve as the perceptual backbone of autonomous agents, from browser automation and computer-use systems to camera-equipped embodied agents. In practice, the attack surface is heterogeneous: adversarial text appears at varying font sizes and under diverse visual conditions, while the growing ecosystem of VLMs exhibits substantial variation in vulnerability, complicating defensive approaches. Evaluating 1,000 prompts from SALAD-Bench across four VLMs, namely, GPT-4o, Claude Sonnet 4.5, Mistral-Large-3, and Qwen3-VL-4B-Instruct under varying font sizes (6--28px) and visual transformations (rotation, blur, noise, contrast changes), we find: (1) font size significantly affects attack success rate (ASR), with very small fonts (6px) yielding near-zero ASR while mid-range fonts achieve peak effectiveness; (2) text attacks are more effective than image attacks for GPT-4o (36% vs 8%) and Claude (47% vs 22%), while Qwen3-VL and Mistral show comparable ASR across modalities; (3) text-image embedding distance from two multimodal embedding models (JinaCLIP and Qwen3-VL-Embedding) shows strong negative correlation with ASR across all four models (r = -0.71 to -0.93, p < 0.01); (4) heavy degradations increase embedding distance by 10--12% and reduce ASR by 34--96%, while rotation asymmetrically affects models (Mistral drops 50%, GPT-4o unchanged). These findings highlight that model-specific robustness patterns preclude one-size-fits-all defenses and offer empirical guidance for practitioners selecting VLM backbones for agentic systems operating in adversarial environments.

Problem

Research questions and friction points this paper is trying to address.

typographic attack

vision-language models

adversarial text

prompt injection

embedding alignment

Innovation

Methods, ideas, or system contributions that make the work stand out.

typographic attack

vision-language models

embedding alignment