Class-Aware Prototype Learning with Negative Contrast for Test-Time Adaptation of Vision-Language Models

📅 2025-10-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Vision-language models suffer substantial degradation in zero-shot generalization under distribution shifts, particularly for long-tailed and semantically similar categories—where prototype degradation and inter-class confusion severely impair performance. To address this, we propose TALPA, a lightweight test-time adaptation framework. Its key contributions are: (1) a dynamic-capacity, class-aware prototype cache enabling fine-grained category representation; (2) a prototype rejuvenation mechanism to mitigate degradation; (3) hard negative-aware contrastive learning to enhance discriminability; and (4) an asymmetric optimization strategy that updates only text prototypes. Evaluated across 15 benchmarks with ResNet-50 and ViT-B/16 backbones, TALPA consistently outperforms state-of-the-art methods, achieving significant gains in generalization robustness—especially for long-tailed and semantically confusable categories.

Technology Category

Application Category

📝 Abstract
Vision-Language Models (VLMs) demonstrate impressive zero-shot generalization through large-scale image-text pretraining, yet their performance can drop once the deployment distribution diverges from the training distribution. To address this, Test-Time Adaptation (TTA) methods update models using unlabeled target data. However, existing approaches often ignore two key challenges: prototype degradation in long-tailed distributions and confusion between semantically similar classes. To tackle these issues, we propose extbf{C}lass-Aware extbf{P}rototype extbf{L}earning with extbf{N}egative extbf{C}ontrast( extbf{CPL-NC}), a lightweight TTA framework designed specifically for VLMs to enhance generalization under distribution shifts. CPL-NC introduces a extit{Class-Aware Prototype Cache} Module that dynamically adjusts per-class capacity based on test-time frequency and activation history, with a rejuvenation mechanism for inactive classes to retain rare-category knowledge. Additionally, a extit{Negative Contrastive Learning} Mechanism identifies and constrains hard visual-textual negatives to improve class separability. The framework employs asymmetric optimization, refining only textual prototypes while anchoring on stable visual features. Experiments on 15 benchmarks show that CPL-NC consistently outperforms prior TTA methods across both ResNet-50 and ViT-B/16 backbones.
Problem

Research questions and friction points this paper is trying to address.

Addresses prototype degradation in long-tailed test distributions
Reduces confusion between semantically similar visual classes
Enhances vision-language model adaptation using unlabeled target data
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic class-aware prototype cache for rare categories
Negative contrastive learning enhances class separability
Asymmetric optimization refines text prototypes only
X
Xiaozhen Qiao
School of Information Science and Technology, University of Science and Technology of China, 100 Fuxing Street, Hefei 230026, P . R. China
J
Jingkai Zhao
Institute of Artificial Intelligence (TeleAI), China Telecom, P . R. China.
Y
Yuqiu Jiang
Institute of Artificial Intelligence (TeleAI), China Telecom, P . R. China.
Xianda Guo
Xianda Guo
PhD Student at Wuhan University
Stereo Matching, Depth Estimation,Gait Recognition
Z
Zhe Sun
School of Artificial Intelligence, OPtics and ElectroNics (iOPEN), Northwestern Polytechnical University, Xi’an 710072, P . R. China
H
Hongyuan Zhang
Institute of Artificial Intelligence (TeleAI), China Telecom, P . R. China.
X
Xuelong Li
Institute of Artificial Intelligence (TeleAI), China Telecom, P . R. China.