$ exttt{BATCLIP}$: Bimodal Online Test-Time Adaptation for CLIP

📅 2024-12-03

📈 Citations: 0

✨ Influential: 0

career value

155K/year

🤖 AI Summary

Zero-shot CLIP models exhibit significant performance degradation under common image corruptions (e.g., noise, blur), and existing unimodal test-time adaptation (TTA) methods fail to effectively mitigate this drop. To address this, we propose the first multimodal online TTA framework that jointly adapts both the visual and textual encoders during inference. Our method constructs class prototypes via pseudo-labeling to strengthen vision-language semantic alignment and introduces a contrastive alignment loss to enable end-to-end co-adaptation. The core innovation lies in introducing multimodal collaborative optimization into the online TTA paradigm—breaking the limitations of unimodal adaptation. Evaluated on corruption benchmarks (e.g., ImageNet-C), our approach achieves state-of-the-art online performance. Moreover, it demonstrates strong generalization across multiple domain generalization datasets, confirming its robustness beyond synthetic corruptions.

Technology Category

Application Category

📝 Abstract

Although open-vocabulary classification models like Contrastive Language Image Pretraining (CLIP) have demonstrated strong zero-shot learning capabilities, their robustness to common image corruptions remains poorly understood. Through extensive experiments, we show that zero-shot CLIP lacks robustness to common image corruptions during test-time, necessitating the adaptation of CLIP to unlabeled corrupted images using test-time adaptation (TTA). However, we found that existing TTA methods have severe limitations in adapting CLIP due to their unimodal nature. To address these limitations, we propose $ exttt{BATCLIP}$, a bimodal $ extbf{online}$ TTA method designed to improve CLIP's robustness to common image corruptions. The key insight of our approach is not only to adapt the visual encoders for improving image features but also to strengthen the alignment between image and text features by promoting a stronger association between the image class prototype, computed using pseudo-labels, and the corresponding text feature. We evaluate our approach on benchmark image corruption datasets and achieve state-of-the-art results in online TTA for CLIP. Furthermore, we evaluate our proposed TTA approach on various domain generalization datasets to demonstrate its generalization capabilities.

Problem

Research questions and friction points this paper is trying to address.

CLIP lacks robustness to common image corruptions.

Existing TTA methods fail due to unimodal limitations.

BATCLIP improves CLIP's robustness via bimodal online TTA.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Bimodal online test-time adaptation for CLIP

Strengthens image-text feature alignment

Uses pseudo-labels for image class prototypes

🔎 Similar Papers

Efficient Open Set Single Image Test Time Adaptation of Vision Language Models