🤖 AI Summary
Vision-language models (e.g., CLIP) suffer substantial degradation in zero-shot classification accuracy under sensor-induced distortions—such as adverse weather, low illumination, and noise—due to severe distribution shifts that existing test-time adaptation (TTA) methods fail to mitigate.
Method: We propose UnInfo, a uniformity-aware information-balancing TTA framework. UnInfo is the first to explicitly optimize for embedding-space uniformity, jointly implementing uniformity-driven confidence maximization, information-aware loss reweighting, and EMA-teacher-guided knowledge distillation to suppress distortion-induced information collapse.
Contribution/Results: UnInfo operates entirely without labels or source-domain data, yet achieves significant improvements in CLIP’s zero-shot classification accuracy across diverse sensor degradation scenarios. Crucially, it preserves both discriminability and uniformity in the learned embedding space. By unifying uniformity regularization with information-aware adaptation, UnInfo establishes a novel paradigm for robust vision-language understanding under real-world sensing impairments.
📝 Abstract
Pre-trained vision-language models such as contrastive language-image pre-training (CLIP) have demonstrated a remarkable generalizability, which has enabled a wide range of applications represented by zero-shot classification. However, vision-language models still suffer when they face datasets with large gaps from training ones, i.e., distribution shifts. We found that CLIP is especially vulnerable to sensor degradation, a type of realistic distribution shift caused by sensor conditions such as weather, light, or noise. Collecting a new dataset from a test distribution for fine-tuning highly costs since sensor degradation occurs unexpectedly and has a range of variety. Thus, we investigate test-time adaptation (TTA) of zero-shot classification, which enables on-the-fly adaptation to the test distribution with unlabeled test data. Existing TTA methods for CLIP mainly focus on modifying image and text embeddings or predictions to address distribution shifts. Although these methods can adapt to domain shifts, such as fine-grained labels spaces or different renditions in input images, they fail to adapt to distribution shifts caused by sensor degradation. We found that this is because image embeddings are"corrupted"in terms of uniformity, a measure related to the amount of information. To make models robust to sensor degradation, we propose a novel method called uniformity-aware information-balanced TTA (UnInfo). To address the corruption of image embeddings, we introduce uniformity-aware confidence maximization, information-aware loss balancing, and knowledge distillation from the exponential moving average (EMA) teacher. Through experiments, we demonstrate that our UnInfo improves accuracy under sensor degradation by retaining information in terms of uniformity.