🤖 AI Summary
Vision-language models (e.g., CLIP) exhibit poor robustness under test-time distribution shifts, and existing cache-based test-time adaptation methods suffer from fixed memory capacity and catastrophic forgetting. This paper proposes a training-free, dynamic online adaptation framework that eliminates sample caching and instead continuously estimates the category posterior distribution over the incoming test stream. We introduce a novel Bayesian inference-based distribution modeling mechanism for real-time probabilistic calibration. Furthermore, we propose an uncertainty-driven human-in-the-loop paradigm that actively identifies high-entropy samples and incorporates human feedback to refine posterior estimates. By unifying distribution estimation, Bayesian inference, and adaptive feedback integration, our method significantly outperforms state-of-the-art approaches across multiple distribution shift benchmarks—improving CLIP’s average accuracy by 5.2% and, for the first time, endowing it with continual online learning capability.
📝 Abstract
Vision-language foundation models (e.g., CLIP) have shown remarkable performance across a wide range of tasks. However, deploying these models may be unreliable when significant distribution gaps exist between the training and test data. The training-free test-time dynamic adapter (TDA) is a promising approach to address this issue by storing representative test samples to guide the classification of subsequent ones. However, TDA only naively maintains a limited number of reference samples in the cache, leading to severe test-time catastrophic forgetting when the cache is updated by dropping samples. In this paper, we propose a simple yet effective method for DistributiOnal Test-time Adaptation (Dota). Instead of naively memorizing representative test samples, Dota continually estimates the distributions of test samples, allowing the model to continually adapt to the deployment environment. The test-time posterior probabilities are then computed using the estimated distributions based on Bayes' theorem for adaptation purposes. To further enhance the adaptability on the uncertain samples, we introduce a new human-in-the-loop paradigm which identifies uncertain samples, collects human-feedback, and incorporates it into the Dota framework. Extensive experiments validate that Dota enables CLIP to continually learn, resulting in a significant improvement compared to current state-of-the-art methods.