DOTA: Distributional Test-Time Adaptation of Vision-Language Models

📅 2024-09-28
🏛️ arXiv.org
📈 Citations: 9
Influential: 2
📄 PDF
🤖 AI Summary
Vision-language models (e.g., CLIP) exhibit poor robustness under test-time distribution shifts, and existing cache-based test-time adaptation methods suffer from fixed memory capacity and catastrophic forgetting. This paper proposes a training-free, dynamic online adaptation framework that eliminates sample caching and instead continuously estimates the category posterior distribution over the incoming test stream. We introduce a novel Bayesian inference-based distribution modeling mechanism for real-time probabilistic calibration. Furthermore, we propose an uncertainty-driven human-in-the-loop paradigm that actively identifies high-entropy samples and incorporates human feedback to refine posterior estimates. By unifying distribution estimation, Bayesian inference, and adaptive feedback integration, our method significantly outperforms state-of-the-art approaches across multiple distribution shift benchmarks—improving CLIP’s average accuracy by 5.2% and, for the first time, endowing it with continual online learning capability.

Technology Category

Application Category

📝 Abstract
Vision-language foundation models (e.g., CLIP) have shown remarkable performance across a wide range of tasks. However, deploying these models may be unreliable when significant distribution gaps exist between the training and test data. The training-free test-time dynamic adapter (TDA) is a promising approach to address this issue by storing representative test samples to guide the classification of subsequent ones. However, TDA only naively maintains a limited number of reference samples in the cache, leading to severe test-time catastrophic forgetting when the cache is updated by dropping samples. In this paper, we propose a simple yet effective method for DistributiOnal Test-time Adaptation (Dota). Instead of naively memorizing representative test samples, Dota continually estimates the distributions of test samples, allowing the model to continually adapt to the deployment environment. The test-time posterior probabilities are then computed using the estimated distributions based on Bayes' theorem for adaptation purposes. To further enhance the adaptability on the uncertain samples, we introduce a new human-in-the-loop paradigm which identifies uncertain samples, collects human-feedback, and incorporates it into the Dota framework. Extensive experiments validate that Dota enables CLIP to continually learn, resulting in a significant improvement compared to current state-of-the-art methods.
Problem

Research questions and friction points this paper is trying to address.

Addressing unreliable deployment of vision-language models under distribution shifts
Mitigating catastrophic forgetting in cache-based test-time adaptation methods
Enabling continuous adaptation to test data streams without costly fine-tuning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Estimates underlying distribution of test data stream
Computes posterior probabilities using Bayes' theorem
Enables continual learning through distribution-centric adaptation
🔎 Similar Papers
No similar papers found.
Zongbo Han
Zongbo Han
Assistant Professor, BUPT; TJU
Machine Learning
J
Jialong Yang
College of Intelligence and Computing, Tianjin University
J
Junfan Li
School of Computer Science and Technology, Harbin Institute of Technology Shenzhen
Qinghua Hu
Qinghua Hu
Professor of Computer Science, Tianjin University
Machine learningData Mining
Qianli Xu
Qianli Xu
Scientist (Institute for Infocomm Research)
Visual intelligenceHuman-computer interaction
M
Mike Zheng Shou
Show Lab, National University of Singapore
Changqing Zhang
Changqing Zhang
Professor, Tianjin University
Machine LearningMultimodal LearningLLM