Bayesian Test-time Adaptation for Object Recognition and Detection with Vision-language Models

📅 2025-10-03

📈 Citations: 0

✨ Influential: 0

career value

184K/year

🤖 AI Summary

Visual-language models (e.g., CLIP, Grounding DINO) suffer significant performance degradation under distribution shifts in recognition and detection tasks. To address this, we propose BCA+, a training-free, backpropagation-free test-time adaptation framework. Methodologically, BCA+ introduces a dynamic caching mechanism that jointly models history-guided adaptive priors and feature-similarity likelihoods, integrating them via uncertainty-weighted fusion to jointly calibrate semantic and spatial context. Furthermore, it incorporates class-embedding alignment and multi-scale spatial matching to enhance cross-task generalization. Evaluated across multiple recognition and detection benchmarks, BCA+ achieves state-of-the-art performance with low latency, demonstrating substantial improvements in robustness and real-time applicability under distribution shifts.

Technology Category

Application Category

📝 Abstract

Vision-language models (VLMs) such as CLIP and Grounding DINO have achieved remarkable success in object recognition and detection. However, their performance often degrades under real-world distribution shifts. Test-time adaptation (TTA) aims to mitigate this issue by adapting models during inference. Existing methods either rely on computationally expensive backpropagation, which hinders real-time deployment, or focus solely on likelihood adaptation, which overlooks the critical role of the prior. Our prior work, Bayesian Class Adaptation (BCA), addressed these shortcomings for object recognition by introducing a training-free framework that incorporates adaptive priors. Building upon this foundation, we now present Bayesian Class Adaptation plus (BCA+), a unified, training-free framework for TTA for both object recognition and detection. BCA+ introduces a dynamic cache that adaptively stores and updates class embeddings, spatial scales (for detection), and, crucially, adaptive class priors derived from historical predictions. We formulate adaptation as a Bayesian inference problem, where final predictions are generated by fusing the initial VLM output with a cache-based prediction. This cache-based prediction combines a dynamically updated likelihood (measuring feature and scale similarity) and a prior (reflecting the evolving class distribution). This dual-adaptation mechanism, coupled with uncertainty-guided fusion, enables BCA+ to correct both the model's semantic understanding and its contextual confidence. As a training-free method requiring no backpropagation, BCA+ is highly efficient. Extensive experiments demonstrate that BCA+ achieves state-of-the-art performance on both recognition and detection benchmarks.

Problem

Research questions and friction points this paper is trying to address.

Adapting vision-language models to distribution shifts during inference

Overcoming computational limitations of backpropagation in real-time deployment

Integrating both likelihood adaptation and dynamic prior updates

Innovation

Methods, ideas, or system contributions that make the work stand out.

Training-free Bayesian framework for object recognition and detection

Dynamic cache updating embeddings, scales, and adaptive priors

Uncertainty-guided fusion combining initial output with cache predictions

🔎 Similar Papers

Efficient Open Set Single Image Test Time Adaptation of Vision Language Models