Efficient Open Set Single Image Test Time Adaptation of Vision Language Models

📅 2024-06-01

📈 Citations: 0

✨ Influential: 0

career value

167K/year

🤖 AI Summary

This work addresses the challenge of online adaptation for vision-language models in dynamic real-world scenarios—specifically under open-set, single-shot, and streaming test conditions. We propose ROSITA, the first framework to establish an open-set, single-image test-time adaptation benchmark. ROSITA introduces a dynamic feature memory bank coupled with a contrastive learning mechanism to jointly optimize reliable sample selection and the decision boundary between known and unknown classes. It further integrates online confidence estimation with a lightweight rejection module to enable real-time, per-sample decisions. Evaluated across multiple real-world benchmarks, ROSITA achieves state-of-the-art performance, attaining both high recognition accuracy (+2.3% mean accuracy) and low inference latency (<50 ms per sample). The method thus establishes a deployable paradigm for robust vision-language understanding in open-world environments.

Technology Category

Application Category

📝 Abstract

Adapting models to dynamic, real-world environments characterized by shifting data distributions and unseen test scenarios is a critical challenge in deep learning. In this paper, we consider a realistic and challenging Test-Time Adaptation setting, where a model must continuously adapt to test samples that arrive sequentially, one at a time, while distinguishing between known and unknown classes. Current Test-Time Adaptation methods operate under closed-set assumptions or batch processing, differing from the real-world open-set scenarios. We address this limitation by establishing a comprehensive benchmark for {em Open-set Single-image Test-Time Adaptation using Vision-Language Models}. Furthermore, we propose ROSITA, a novel framework that leverages dynamically updated feature banks to identify reliable test samples and employs a contrastive learning objective to improve the separation between known and unknown classes. Our approach effectively adapts models to domain shifts for known classes while rejecting unfamiliar samples. Extensive experiments across diverse real-world benchmarks demonstrate that ROSITA sets a new state-of-the-art in open-set TTA, achieving both strong performance and computational efficiency for real-time deployment. Our code can be found at the project site https://manogna-s.github.io/rosita/

Problem

Research questions and friction points this paper is trying to address.

Adapting models to dynamic real-world environments with shifting data distributions

Distinguishing between known and unknown classes during single-image test-time adaptation

Overcoming limitations of closed-set assumptions in current test-time adaptation methods

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses dynamically updated feature banks

Employs contrastive learning for class separation

Adapts models to domain shifts efficiently

🔎 Similar Papers

No similar papers found.