Pre-training CLIP against Data Poisoning with Optimal Transport-based Matching and Alignment

📅 2025-09-23

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

CLIP’s reliance on web-scraped image-text pairs renders it vulnerable to data poisoning and backdoor attacks; existing defenses employ only global representations for cross-modal matching, neglecting fine-grained vision-language semantics, thereby suffering from misalignment and performance degradation. To address this, we propose an Optimal Transport (OT)-guided framework for fine-grained image-text matching and alignment. Our method first constructs a cross-modal distance metric between pixel-level visual patches and token-level textual units. It then introduces OT-driven intra-modal consistency and inter-modal alignment objectives within contrastive learning, explicitly penalizing mismatched image-text pairs. The resulting model preserves pretraining efficacy while substantially reducing poisoning attack success rates. Extensive experiments demonstrate that our approach consistently outperforms state-of-the-art defenses across zero-shot classification and linear probe benchmarks, establishing new performance frontiers in robust vision-language representation learning.

Technology Category

Application Category

📝 Abstract

Recent studies have shown that Contrastive Language-Image Pre-training (CLIP) models are threatened by targeted data poisoning and backdoor attacks due to massive training image-caption pairs crawled from the Internet. Previous defense methods correct poisoned image-caption pairs by matching a new caption for each image. However, the matching process relies solely on the global representations of images and captions, overlooking fine-grained features of visual and textual features. It may introduce incorrect image-caption pairs and harm the CLIP pre-training. To address their limitations, we propose an Optimal Transport-based framework to reconstruct image-caption pairs, named OTCCLIP. We propose a new optimal transport-based distance measure between fine-grained visual and textual feature sets and re-assign new captions based on the proposed optimal transport distance. Additionally, to further reduce the negative impact of mismatched pairs, we encourage the inter- and intra-modality fine-grained alignment by employing optimal transport-based objective functions. Our experiments demonstrate that OTCCLIP can successfully decrease the attack success rates of poisoning attacks. Also, compared to previous methods, OTCCLIP significantly improves CLIP's zero-shot and linear probing performance trained on poisoned datasets.

Problem

Research questions and friction points this paper is trying to address.

Defending CLIP models against targeted data poisoning and backdoor attacks

Addressing limitations of global representation matching in image-caption pairs

Improving fine-grained alignment between visual and textual feature sets

Innovation

Methods, ideas, or system contributions that make the work stand out.

Optimal transport-based distance for fine-grained features

Reassigning captions using optimal transport distance

Inter- and intra-modality alignment via optimal transport

🔎 Similar Papers

No similar papers found.