Towards Mechanistic Defenses Against Typographic Attacks in CLIP

📅 2025-08-28

📈 Citations: 0

✨ Influential: 0

career value

194K/year

🤖 AI Summary

This work addresses targeted misclassification, malicious content generation, and jailbreaking risks in CLIP models under typographic adversarial attacks. For the first time, it identifies—through mechanistic analysis—the specific attention heads in the vision encoder responsible for glyph feature extraction and propagation, revealing a dedicated “glyph circuit.” Building on this insight, we propose a lightweight, fine-tuning-free defense paradigm: selective ablation of this circuit to yield a “dyslexic” CLIP variant. Our method integrates attention-head functional analysis with causal attribution techniques to enable precise, neuro-pathway-level intervention. Evaluated on ImageNet-100 typographic variants, the defense achieves a 19.6% improvement in success rate against such attacks, while incurring less than 1% degradation in standard accuracy—matching the performance of state-of-the-art fine-tuning-based defenses. This work advances interpretable robustness research for vision-language models by bridging mechanistic understanding with actionable, architecture-aware interventions.

Technology Category

Application Category

📝 Abstract

Typographic attacks exploit multi-modal systems by injecting text into images, leading to targeted misclassifications, malicious content generation and even Vision-Language Model jailbreaks. In this work, we analyze how CLIP vision encoders behave under typographic attacks, locating specialized attention heads in the latter half of the model's layers that causally extract and transmit typographic information to the cls token. Building on these insights, we introduce a method to defend CLIP models against typographic attacks by selectively ablating a typographic circuit, consisting of attention heads. Without requiring finetuning, our method improves performance by up to 19.6% on a typographic variant of ImageNet-100, while reducing standard ImageNet-100 accuracy by less than 1%. Notably, our training-free approach remains competitive with current state-of-the-art typographic defenses that rely on finetuning. To this end, we release a family of dyslexic CLIP models which are significantly more robust against typographic attacks. These models serve as suitable drop-in replacements for a broad range of safety-critical applications, where the risks of text-based manipulation outweigh the utility of text recognition.

Problem

Research questions and friction points this paper is trying to address.

Defending CLIP models against typographic attack vulnerabilities

Identifying attention heads transmitting typographic information in CLIP

Developing training-free defense against text-injection image attacks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Selectively ablating typographic circuit attention heads

Training-free defense method against typographic attacks

Dyslexic CLIP models as drop-in replacements

🔎 Similar Papers

Vision-LLMs Can Fool Themselves with Self-Generated Typographic Attacks