🤖 AI Summary
Vision-language models (VLMs) like CLIP exhibit strong zero-shot capabilities but suffer from poor generalization under distribution shifts. Existing test-time adaptation (TTA) methods predominantly rely on entropy minimization, which misaligns with CLIP’s contrastive image-text pretraining objective and often leads to pseudo-label drift and class collapse. This work proposes CLIPTTA: the first TTA framework to incorporate a soft contrastive loss, directly aligning with CLIP’s pretraining objective; we theoretically prove that this loss mitigates class collapse. Furthermore, we introduce Outlier Contrastive Exposure (OCE), a novel loss enabling open-set out-of-distribution (OOD) detection. To enhance optimization stability, we design a batch-aware gradient update strategy. Extensive experiments across 75 benchmarks demonstrate that CLIPTTA consistently outperforms entropy-based and state-of-the-art TTA methods, achieving substantial gains in robustness and adaptation accuracy.
📝 Abstract
Vision-language models (VLMs) like CLIP exhibit strong zero-shot capabilities but often fail to generalize under distribution shifts. Test-time adaptation (TTA) allows models to update at inference time without labeled data, typically via entropy minimization. However, this objective is fundamentally misaligned with the contrastive image-text training of VLMs, limiting adaptation performance and introducing failure modes such as pseudo-label drift and class collapse. We propose CLIPTTA, a new gradient-based TTA method for vision-language models that leverages a soft contrastive loss aligned with CLIP's pre-training objective. We provide a theoretical analysis of CLIPTTA's gradients, showing how its batch-aware design mitigates the risk of collapse. We further extend CLIPTTA to the open-set setting, where both in-distribution (ID) and out-of-distribution (OOD) samples are encountered, using an Outlier Contrastive Exposure (OCE) loss to improve OOD detection. Evaluated on 75 datasets spanning diverse distribution shifts, CLIPTTA consistently outperforms entropy-based objectives and is highly competitive with state-of-the-art TTA methods, outperforming them on a large number of datasets and exhibiting more stable performance across diverse shifts.