DOTA: Distributional Test-Time Adaptation of Vision-Language Models

📅 2024-09-28

🏛️ arXiv.org

📈 Citations: 9

✨ Influential: 2

career value

172K/year

🤖 AI Summary

Vision-language models (e.g., CLIP) exhibit poor robustness under test-time distribution shifts, and existing cache-based test-time adaptation methods suffer from fixed memory capacity and catastrophic forgetting. This paper proposes a training-free, dynamic online adaptation framework that eliminates sample caching and instead continuously estimates the category posterior distribution over the incoming test stream. We introduce a novel Bayesian inference-based distribution modeling mechanism for real-time probabilistic calibration. Furthermore, we propose an uncertainty-driven human-in-the-loop paradigm that actively identifies high-entropy samples and incorporates human feedback to refine posterior estimates. By unifying distribution estimation, Bayesian inference, and adaptive feedback integration, our method significantly outperforms state-of-the-art approaches across multiple distribution shift benchmarks—improving CLIP’s average accuracy by 5.2% and, for the first time, endowing it with continual online learning capability.

Technology Category

Application Category

📝 Abstract

Vision-language foundation models (e.g., CLIP) have shown remarkable performance across a wide range of tasks. However, deploying these models may be unreliable when significant distribution gaps exist between the training and test data. The training-free test-time dynamic adapter (TDA) is a promising approach to address this issue by storing representative test samples to guide the classification of subsequent ones. However, TDA only naively maintains a limited number of reference samples in the cache, leading to severe test-time catastrophic forgetting when the cache is updated by dropping samples. In this paper, we propose a simple yet effective method for DistributiOnal Test-time Adaptation (Dota). Instead of naively memorizing representative test samples, Dota continually estimates the distributions of test samples, allowing the model to continually adapt to the deployment environment. The test-time posterior probabilities are then computed using the estimated distributions based on Bayes' theorem for adaptation purposes. To further enhance the adaptability on the uncertain samples, we introduce a new human-in-the-loop paradigm which identifies uncertain samples, collects human-feedback, and incorporates it into the Dota framework. Extensive experiments validate that Dota enables CLIP to continually learn, resulting in a significant improvement compared to current state-of-the-art methods.

Problem

Research questions and friction points this paper is trying to address.

Addressing unreliable deployment of vision-language models under distribution shifts

Mitigating catastrophic forgetting in cache-based test-time adaptation methods

Enabling continuous adaptation to test data streams without costly fine-tuning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Estimates underlying distribution of test data stream

Computes posterior probabilities using Bayes' theorem

Enables continual learning through distribution-centric adaptation

🔎 Similar Papers

Efficient Open Set Single Image Test Time Adaptation of Vision Language Models