Class-Aware Prototype Learning with Negative Contrast for Test-Time Adaptation of Vision-Language Models

📅 2025-10-22

📈 Citations: 0

✨ Influential: 0

career value

181K/year

🤖 AI Summary

Vision-language models suffer substantial degradation in zero-shot generalization under distribution shifts, particularly for long-tailed and semantically similar categories—where prototype degradation and inter-class confusion severely impair performance. To address this, we propose TALPA, a lightweight test-time adaptation framework. Its key contributions are: (1) a dynamic-capacity, class-aware prototype cache enabling fine-grained category representation; (2) a prototype rejuvenation mechanism to mitigate degradation; (3) hard negative-aware contrastive learning to enhance discriminability; and (4) an asymmetric optimization strategy that updates only text prototypes. Evaluated across 15 benchmarks with ResNet-50 and ViT-B/16 backbones, TALPA consistently outperforms state-of-the-art methods, achieving significant gains in generalization robustness—especially for long-tailed and semantically confusable categories.

Technology Category

Application Category

📝 Abstract

Vision-Language Models (VLMs) demonstrate impressive zero-shot generalization through large-scale image-text pretraining, yet their performance can drop once the deployment distribution diverges from the training distribution. To address this, Test-Time Adaptation (TTA) methods update models using unlabeled target data. However, existing approaches often ignore two key challenges: prototype degradation in long-tailed distributions and confusion between semantically similar classes. To tackle these issues, we propose extbf{C}lass-Aware extbf{P}rototype extbf{L}earning with extbf{N}egative extbf{C}ontrast( extbf{CPL-NC}), a lightweight TTA framework designed specifically for VLMs to enhance generalization under distribution shifts. CPL-NC introduces a extit{Class-Aware Prototype Cache} Module that dynamically adjusts per-class capacity based on test-time frequency and activation history, with a rejuvenation mechanism for inactive classes to retain rare-category knowledge. Additionally, a extit{Negative Contrastive Learning} Mechanism identifies and constrains hard visual-textual negatives to improve class separability. The framework employs asymmetric optimization, refining only textual prototypes while anchoring on stable visual features. Experiments on 15 benchmarks show that CPL-NC consistently outperforms prior TTA methods across both ResNet-50 and ViT-B/16 backbones.

Problem

Research questions and friction points this paper is trying to address.

Addresses prototype degradation in long-tailed test distributions

Reduces confusion between semantically similar visual classes

Enhances vision-language model adaptation using unlabeled target data

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic class-aware prototype cache for rare categories

Negative contrastive learning enhances class separability

Asymmetric optimization refines text prototypes only

🔎 Similar Papers

Efficient Open Set Single Image Test Time Adaptation of Vision Language Models