Robustness in Both Domains: CLIP Needs a Robust Text Encoder

📅 2025-06-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
CLIP’s text encoder exhibits severe vulnerability to adversarial attacks, undermining the reliability of downstream multimodal applications such as text-to-image generation and cross-modal retrieval. This work presents the first systematic study and enhancement of the adversarial robustness of CLIP’s text encoder, proposing LEAF—a lightweight, scalable adversarial fine-tuning framework. LEAF integrates embedding-space adversarial training with zero-shot transfer optimization, improving zero-shot adversarial accuracy in the text domain without degrading visual-encoder performance. It enhances text-to-image generation fidelity and cross-modal retrieval recall, while enabling high-fidelity inversion and reconstruction of text embeddings. Extensive experiments validate LEAF’s effectiveness and generalizability across large-scale CLIP variants (e.g., ViT-L/14, RN50x16), demonstrating consistent robustness gains under diverse attack settings. Our approach establishes a new direction for advancing adversarial robustness in multimodal foundation models.

Technology Category

Application Category

📝 Abstract
Adversarial input attacks can cause a significant shift of CLIP embeddings. This can affect the downstream robustness of models incorporating CLIP in the pipeline, such as text-to-image generative models or large vision language models. While some efforts have been done towards making the CLIP image encoders robust, the robustness of text encoders remains unexplored. In this work, we cover this gap in the literature. We propose LEAF: an efficient adversarial finetuning method for the text domain, with the ability to scale to large CLIP models. Our models significantly improve the zero-shot adversarial accuracy in the text domain, while maintaining the vision performance provided by robust image encoders. When combined with text-to-image diffusion models, we can improve the generation quality under adversarial noise. When employing our robust CLIP encoders in multimodal retrieval tasks, we improve the recall under adversarial noise over standard CLIP models. Finally, we show that robust text encoders facilitate better reconstruction of input text from its embedding via direct optimization.
Problem

Research questions and friction points this paper is trying to address.

Adversarial attacks disrupt CLIP text embeddings' robustness
Lack of robust CLIP text encoders in literature
Need improved adversarial accuracy without compromising vision performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Proposes LEAF for adversarial text encoder finetuning
Improves zero-shot adversarial accuracy in text
Enhances generation quality under adversarial noise
🔎 Similar Papers