🤖 AI Summary
Vision-language models suffer substantial degradation in zero-shot generalization under distribution shifts, particularly for long-tailed and semantically similar categories—where prototype degradation and inter-class confusion severely impair performance. To address this, we propose TALPA, a lightweight test-time adaptation framework. Its key contributions are: (1) a dynamic-capacity, class-aware prototype cache enabling fine-grained category representation; (2) a prototype rejuvenation mechanism to mitigate degradation; (3) hard negative-aware contrastive learning to enhance discriminability; and (4) an asymmetric optimization strategy that updates only text prototypes. Evaluated across 15 benchmarks with ResNet-50 and ViT-B/16 backbones, TALPA consistently outperforms state-of-the-art methods, achieving significant gains in generalization robustness—especially for long-tailed and semantically confusable categories.
📝 Abstract
Vision-Language Models (VLMs) demonstrate impressive zero-shot generalization through large-scale image-text pretraining, yet their performance can drop once the deployment distribution diverges from the training distribution. To address this, Test-Time Adaptation (TTA) methods update models using unlabeled target data. However, existing approaches often ignore two key challenges: prototype degradation in long-tailed distributions and confusion between semantically similar classes. To tackle these issues, we propose extbf{C}lass-Aware extbf{P}rototype extbf{L}earning with extbf{N}egative extbf{C}ontrast( extbf{CPL-NC}), a lightweight TTA framework designed specifically for VLMs to enhance generalization under distribution shifts. CPL-NC introduces a extit{Class-Aware Prototype Cache} Module that dynamically adjusts per-class capacity based on test-time frequency and activation history, with a rejuvenation mechanism for inactive classes to retain rare-category knowledge. Additionally, a extit{Negative Contrastive Learning} Mechanism identifies and constrains hard visual-textual negatives to improve class separability. The framework employs asymmetric optimization, refining only textual prototypes while anchoring on stable visual features. Experiments on 15 benchmarks show that CPL-NC consistently outperforms prior TTA methods across both ResNet-50 and ViT-B/16 backbones.