🤖 AI Summary
Zero-shot CLIP models exhibit significant performance degradation under common image corruptions (e.g., noise, blur), and existing unimodal test-time adaptation (TTA) methods fail to effectively mitigate this drop. To address this, we propose the first multimodal online TTA framework that jointly adapts both the visual and textual encoders during inference. Our method constructs class prototypes via pseudo-labeling to strengthen vision-language semantic alignment and introduces a contrastive alignment loss to enable end-to-end co-adaptation. The core innovation lies in introducing multimodal collaborative optimization into the online TTA paradigm—breaking the limitations of unimodal adaptation. Evaluated on corruption benchmarks (e.g., ImageNet-C), our approach achieves state-of-the-art online performance. Moreover, it demonstrates strong generalization across multiple domain generalization datasets, confirming its robustness beyond synthetic corruptions.
📝 Abstract
Although open-vocabulary classification models like Contrastive Language Image Pretraining (CLIP) have demonstrated strong zero-shot learning capabilities, their robustness to common image corruptions remains poorly understood. Through extensive experiments, we show that zero-shot CLIP lacks robustness to common image corruptions during test-time, necessitating the adaptation of CLIP to unlabeled corrupted images using test-time adaptation (TTA). However, we found that existing TTA methods have severe limitations in adapting CLIP due to their unimodal nature. To address these limitations, we propose $ exttt{BATCLIP}$, a bimodal $ extbf{online}$ TTA method designed to improve CLIP's robustness to common image corruptions. The key insight of our approach is not only to adapt the visual encoders for improving image features but also to strengthen the alignment between image and text features by promoting a stronger association between the image class prototype, computed using pseudo-labels, and the corresponding text feature. We evaluate our approach on benchmark image corruption datasets and achieve state-of-the-art results in online TTA for CLIP. Furthermore, we evaluate our proposed TTA approach on various domain generalization datasets to demonstrate its generalization capabilities.