π€ AI Summary
CLIPβs reliance on web-scraped image-text pairs renders it vulnerable to data poisoning and backdoor attacks; existing defenses employ only global representations for cross-modal matching, neglecting fine-grained vision-language semantics, thereby suffering from misalignment and performance degradation. To address this, we propose an Optimal Transport (OT)-guided framework for fine-grained image-text matching and alignment. Our method first constructs a cross-modal distance metric between pixel-level visual patches and token-level textual units. It then introduces OT-driven intra-modal consistency and inter-modal alignment objectives within contrastive learning, explicitly penalizing mismatched image-text pairs. The resulting model preserves pretraining efficacy while substantially reducing poisoning attack success rates. Extensive experiments demonstrate that our approach consistently outperforms state-of-the-art defenses across zero-shot classification and linear probe benchmarks, establishing new performance frontiers in robust vision-language representation learning.
π Abstract
Recent studies have shown that Contrastive Language-Image Pre-training (CLIP) models are threatened by targeted data poisoning and backdoor attacks due to massive training image-caption pairs crawled from the Internet. Previous defense methods correct poisoned image-caption pairs by matching a new caption for each image. However, the matching process relies solely on the global representations of images and captions, overlooking fine-grained features of visual and textual features. It may introduce incorrect image-caption pairs and harm the CLIP pre-training. To address their limitations, we propose an Optimal Transport-based framework to reconstruct image-caption pairs, named OTCCLIP. We propose a new optimal transport-based distance measure between fine-grained visual and textual feature sets and re-assign new captions based on the proposed optimal transport distance. Additionally, to further reduce the negative impact of mismatched pairs, we encourage the inter- and intra-modality fine-grained alignment by employing optimal transport-based objective functions. Our experiments demonstrate that OTCCLIP can successfully decrease the attack success rates of poisoning attacks. Also, compared to previous methods, OTCCLIP significantly improves CLIP's zero-shot and linear probing performance trained on poisoned datasets.