CLIPTTA: Robust Contrastive Vision-Language Test-Time Adaptation

📅 2025-07-18

📈 Citations: 0

✨ Influential: 0

career value

175K/year

🤖 AI Summary

Vision-language models (VLMs) like CLIP exhibit strong zero-shot capabilities but suffer from poor generalization under distribution shifts. Existing test-time adaptation (TTA) methods predominantly rely on entropy minimization, which misaligns with CLIP’s contrastive image-text pretraining objective and often leads to pseudo-label drift and class collapse. This work proposes CLIPTTA: the first TTA framework to incorporate a soft contrastive loss, directly aligning with CLIP’s pretraining objective; we theoretically prove that this loss mitigates class collapse. Furthermore, we introduce Outlier Contrastive Exposure (OCE), a novel loss enabling open-set out-of-distribution (OOD) detection. To enhance optimization stability, we design a batch-aware gradient update strategy. Extensive experiments across 75 benchmarks demonstrate that CLIPTTA consistently outperforms entropy-based and state-of-the-art TTA methods, achieving substantial gains in robustness and adaptation accuracy.

Technology Category

Application Category

📝 Abstract

Vision-language models (VLMs) like CLIP exhibit strong zero-shot capabilities but often fail to generalize under distribution shifts. Test-time adaptation (TTA) allows models to update at inference time without labeled data, typically via entropy minimization. However, this objective is fundamentally misaligned with the contrastive image-text training of VLMs, limiting adaptation performance and introducing failure modes such as pseudo-label drift and class collapse. We propose CLIPTTA, a new gradient-based TTA method for vision-language models that leverages a soft contrastive loss aligned with CLIP's pre-training objective. We provide a theoretical analysis of CLIPTTA's gradients, showing how its batch-aware design mitigates the risk of collapse. We further extend CLIPTTA to the open-set setting, where both in-distribution (ID) and out-of-distribution (OOD) samples are encountered, using an Outlier Contrastive Exposure (OCE) loss to improve OOD detection. Evaluated on 75 datasets spanning diverse distribution shifts, CLIPTTA consistently outperforms entropy-based objectives and is highly competitive with state-of-the-art TTA methods, outperforming them on a large number of datasets and exhibiting more stable performance across diverse shifts.

Problem

Research questions and friction points this paper is trying to address.

Addresses vision-language model generalization under distribution shifts

Mitigates pseudo-label drift and class collapse in test-time adaptation

Improves open-set adaptation with outlier contrastive exposure loss

Innovation

Methods, ideas, or system contributions that make the work stand out.

Soft contrastive loss aligns with CLIP training

Batch-aware design mitigates collapse risk

OCE loss improves open-set OOD detection

🔎 Similar Papers

Efficient Open Set Single Image Test Time Adaptation of Vision Language Models