🤖 AI Summary
Vision-language models like CLIP suffer from performance degradation in zero-shot image classification under distribution shifts, as existing test-time adaptation (TTA) methods only optimize class embeddings (likelihood) while neglecting prior modeling. Method: We propose the first Bayesian-inspired decoupled dual-update framework for TTA, which simultaneously estimates the posterior and updates the class priors online—jointly optimizing them with class embeddings. Contribution/Results: Our approach bridges a critical theoretical and technical gap in test-time prior adaptation. It achieves significant improvements over state-of-the-art methods across multiple distribution shift benchmarks—including corruption, domain, and style shifts—while also accelerating inference and reducing memory overhead. The method requires no access to source data or labels during adaptation, ensuring practical applicability in realistic deployment scenarios.
📝 Abstract
Test-time adaptation with pre-trained vision-language models, such as CLIP, aims to adapt the model to new, potentially out-of-distribution test data. Existing methods calculate the similarity between visual embedding and learnable class embeddings, which are initialized by text embeddings, for zero-shot image classification. In this work, we first analyze this process based on Bayes theorem, and observe that the core factors influencing the final prediction are the likelihood and the prior. However, existing methods essentially focus on adapting class embeddings to adapt likelihood, but they often ignore the importance of prior. To address this gap, we propose a novel approach, extbf{B}ayesian extbf{C}lass extbf{A}daptation (BCA), which in addition to continuously updating class embeddings to adapt likelihood, also uses the posterior of incoming samples to continuously update the prior for each class embedding. This dual updating mechanism allows the model to better adapt to distribution shifts and achieve higher prediction accuracy. Our method not only surpasses existing approaches in terms of performance metrics but also maintains superior inference rates and memory usage, making it highly efficient and practical for real-world applications.