Bayesian Test-Time Adaptation for Vision-Language Models

📅 2025-03-12

📈 Citations: 0

✨ Influential: 0

career value

159K/year

🤖 AI Summary

Vision-language models like CLIP suffer from performance degradation in zero-shot image classification under distribution shifts, as existing test-time adaptation (TTA) methods only optimize class embeddings (likelihood) while neglecting prior modeling. Method: We propose the first Bayesian-inspired decoupled dual-update framework for TTA, which simultaneously estimates the posterior and updates the class priors online—jointly optimizing them with class embeddings. Contribution/Results: Our approach bridges a critical theoretical and technical gap in test-time prior adaptation. It achieves significant improvements over state-of-the-art methods across multiple distribution shift benchmarks—including corruption, domain, and style shifts—while also accelerating inference and reducing memory overhead. The method requires no access to source data or labels during adaptation, ensuring practical applicability in realistic deployment scenarios.

Technology Category

Application Category

📝 Abstract

Test-time adaptation with pre-trained vision-language models, such as CLIP, aims to adapt the model to new, potentially out-of-distribution test data. Existing methods calculate the similarity between visual embedding and learnable class embeddings, which are initialized by text embeddings, for zero-shot image classification. In this work, we first analyze this process based on Bayes theorem, and observe that the core factors influencing the final prediction are the likelihood and the prior. However, existing methods essentially focus on adapting class embeddings to adapt likelihood, but they often ignore the importance of prior. To address this gap, we propose a novel approach, extbf{B}ayesian extbf{C}lass extbf{A}daptation (BCA), which in addition to continuously updating class embeddings to adapt likelihood, also uses the posterior of incoming samples to continuously update the prior for each class embedding. This dual updating mechanism allows the model to better adapt to distribution shifts and achieve higher prediction accuracy. Our method not only surpasses existing approaches in terms of performance metrics but also maintains superior inference rates and memory usage, making it highly efficient and practical for real-world applications.

Problem

Research questions and friction points this paper is trying to address.

Adapting vision-language models to out-of-distribution test data

Improving zero-shot image classification using Bayesian principles

Enhancing prediction accuracy by updating both likelihood and prior

Innovation

Methods, ideas, or system contributions that make the work stand out.

Bayesian Class Adaptation updates class embeddings

Uses posterior samples to update class priors

Enhances prediction accuracy and model efficiency

🔎 Similar Papers

Efficient Open Set Single Image Test Time Adaptation of Vision Language Models