Mint: A Simple Test-Time Adaptation of Vision-Language Models against Common Corruptions

📅 2025-10-24

📈 Citations: 0

✨ Influential: 0

career value

178K/year

🤖 AI Summary

Pretrained vision-language models (e.g., CLIP) suffer severe robustness degradation under image corruptions. We identify “embedding variance collapse” as the root cause: both intra-class and inter-class embedding variances shrink concurrently with increasing corruption severity, and inter-class variance strongly correlates with classification accuracy. To address this, we propose a lightweight test-time adaptation method that maximizes inter-class variance online—using only pseudo-labels, mean embedding estimation, and gradient accumulators—without additional parameters or training. Our approach is architecture-agnostic and applicable to any CLIP variant. Evaluated on standard corruption benchmarks (ImageNet-C, CIFAR-10-C), it delivers consistent improvements across diverse degradation types and remains effective even with small batch sizes. The method significantly enhances zero-shot generalization robustness under distribution shifts, establishing a new state-of-the-art in test-time adaptation for vision-language models.

Technology Category

Application Category

📝 Abstract

Pretrained vision-language models such as CLIP achieve strong zero-shot generalization but remain vulnerable to distribution shifts caused by input corruptions. In this work, we investigate how corruptions affect CLIP's image embeddings and uncover a consistent phenomenon we term as embedding variance collapse, where both intra-class and inter-class variances shrink as corruption severity increases. We find that this collapse is closely tied to performance degradation, with inter-class variance strongly correlated with classification accuracy. To explain this phenomenon, we analyze how corruptions alter the structure of the embedding space. Our theoretical results suggest that the visual encoder tends to encode corruption-related signals, which dilute class-discriminative features and compress the representation geometry. We further show that maximizing inter-class variance, even when estimated from pseudo-labels, can provably enhance embedding quality. Based on this insight, we propose Mint, a simple test-time adaptation method that maximizes pseudo-label-based inter-class variance on the fly using a mean accumulator and a gradient accumulator. Mint operates effectively with small batch sizes and consistently improves performance across multiple corruption benchmarks and CLIP architectures. Our code is available at https://github.com/baowenxuan/Mint .

Problem

Research questions and friction points this paper is trying to address.

Addresses CLIP's vulnerability to image corruptions causing embedding variance collapse

Analyzes how corruptions dilute discriminative features in vision-language models

Proposes test-time adaptation to maximize inter-class variance using pseudo-labels

Innovation

Methods, ideas, or system contributions that make the work stand out.

Maximizes pseudo-label-based inter-class variance

Uses mean and gradient accumulators for adaptation

Operates effectively with small batch sizes

🔎 Similar Papers

No similar papers found.