🤖 AI Summary
This work addresses the challenge of online adaptation for vision-language models in dynamic real-world scenarios—specifically under open-set, single-shot, and streaming test conditions. We propose ROSITA, the first framework to establish an open-set, single-image test-time adaptation benchmark. ROSITA introduces a dynamic feature memory bank coupled with a contrastive learning mechanism to jointly optimize reliable sample selection and the decision boundary between known and unknown classes. It further integrates online confidence estimation with a lightweight rejection module to enable real-time, per-sample decisions. Evaluated across multiple real-world benchmarks, ROSITA achieves state-of-the-art performance, attaining both high recognition accuracy (+2.3% mean accuracy) and low inference latency (<50 ms per sample). The method thus establishes a deployable paradigm for robust vision-language understanding in open-world environments.
📝 Abstract
Adapting models to dynamic, real-world environments characterized by shifting data distributions and unseen test scenarios is a critical challenge in deep learning. In this paper, we consider a realistic and challenging Test-Time Adaptation setting, where a model must continuously adapt to test samples that arrive sequentially, one at a time, while distinguishing between known and unknown classes. Current Test-Time Adaptation methods operate under closed-set assumptions or batch processing, differing from the real-world open-set scenarios. We address this limitation by establishing a comprehensive benchmark for {em Open-set Single-image Test-Time Adaptation using Vision-Language Models}. Furthermore, we propose ROSITA, a novel framework that leverages dynamically updated feature banks to identify reliable test samples and employs a contrastive learning objective to improve the separation between known and unknown classes. Our approach effectively adapts models to domain shifts for known classes while rejecting unfamiliar samples. Extensive experiments across diverse real-world benchmarks demonstrate that ROSITA sets a new state-of-the-art in open-set TTA, achieving both strong performance and computational efficiency for real-time deployment. Our code can be found at the project site https://manogna-s.github.io/rosita/