Prototype-Based Test-Time Adaptation of Vision-Language Models

πŸ“… 2026-04-23
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF

career value

199K/year
πŸ€– AI Summary
This work addresses the high inference latency and limited performance of existing test-time adaptation methods that rely on caching mechanisms and eschew backpropagation. The authors propose an efficient, prototype-based adaptation approach that eliminates conventional caches and instead dynamically weights zero-shot confidence scores to accumulate knowledge from test samples into updatable class-specific prototypes. These prototypes are integrated with vision-language models such as CLIP to enable backpropagation-free adaptation. The method achieves state-of-the-art results across 15 image recognition and 4 point cloud benchmarks, maintaining 92% of CLIP’s original inference speed on ImageNet-1K while significantly improving out-of-distribution accuracy from 65.64% to 69.38%.

Technology Category

Application Category

πŸ“ Abstract
Test-time adaptation (TTA) has emerged as a promising paradigm for vision-language models (VLMs) to bridge the distribution gap between pre-training and test data. Recent works have focused on backpropagation-free TTA methods that rely on cache-based designs, but these introduce two key limitations. First, inference latency increases as the cache grows with the number of classes, leading to inefficiencies in large-scale settings. Second, suboptimal performance occurs when the cache contains insufficient or incorrect samples. In this paper, we present Prototype-Based Test-Time Adaptation (PTA), an efficient and effective TTA paradigm that uses a set of class-specific knowledge prototypes to accumulate knowledge from test samples. Particularly, knowledge prototypes are adaptively weighted based on the zero-shot class confidence of each test sample, incorporating the sample's visual features into the corresponding class-specific prototype. It is worth highlighting that the knowledge from past test samples is integrated and utilized solely in the prototypes, eliminating the overhead of cache population and retrieval that hinders the efficiency of existing TTA methods. This endows PTA with extremely high efficiency while achieving state-of-the-art performance on 15 image recognition benchmarks and 4 robust point cloud analysis benchmarks. For example, PTA improves CLIP's accuracy from 65.64% to 69.38% on 10 cross-domain benchmarks, while retaining 92% of CLIP's inference speed on large-scale ImageNet-1K. In contrast, the cache-based TDA achieves a lower accuracy of 67.97% and operates at only 50% of CLIP's inference speed.
Problem

Research questions and friction points this paper is trying to address.

test-time adaptation
vision-language models
cache-based methods
inference latency
distribution shift
Innovation

Methods, ideas, or system contributions that make the work stand out.

Prototype-Based Adaptation
Test-Time Adaptation
Vision-Language Models
Zero-Shot Learning
Efficient Inference
Z
Zhaohong Huang
Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University, 361005, P.R. China
Yuxin Zhang
Yuxin Zhang
Xiamen University
Network sparsityModel compression
W
Wenjing Liu
Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University, 361005, P.R. China
F
Fei Chao
Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University, 361005, P.R. China
R
Rongrong Ji
Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University, 361005, P.R. China