ETTA: Efficient Test-Time Adaptation for Vision-Language Models through Dynamic Embedding Updates

📅 2025-08-07

📈 Citations: 0

✨ Influential: 0

career value

157K/year

🤖 AI Summary

Vision-language models (e.g., CLIP) suffer from poor zero-shot generalization under distribution shifts. Method: We propose Efficient Test-Time Adaptation (ETTA), a training-free, gradient-free approach that recursively updates contextual embeddings to emulate an unbounded cache—thereby incorporating all test samples for continuous boundary refinement. ETTA introduces a confidence-driven adaptive prompt ensembling module that dynamically selects optimal textual prompts without parameter updates, relying solely on the pre-trained model and lightweight caching. Contribution/Results: On multiple distribution shift benchmarks, ETTA achieves state-of-the-art performance, improving average accuracy by 3.2% over prior methods. Crucially, it incurs negligible computational and memory overhead—demonstrating both high efficiency and strong robustness under distribution shifts.

Technology Category

Application Category

📝 Abstract

Pretrained vision-language models (VLMs) like CLIP show strong zero-shot performance but struggle with generalization under distribution shifts. Test-Time Adaptation (TTA) addresses this by adapting VLMs to unlabeled test data in new domains. While some TTA methods rely on prompt-tuning, training-free cache-based approaches are preferred for efficiency. However, current cache-based TTA models store only a limited set of high-confidence samples, restricting the decision boundary to these samples and ignoring the influence of other incoming test data. To address this, we propose Efficient Test-Time Adaptation (ETTA), introducing a Recursive Updating module that integrates all incoming test samples, progressively refining the decision boundary. This strategy mimics an unbounded cache, dynamically updating contextual embeddings for improved accuracy with minimal memory and computational overhead. ETTA also includes an Adaptive Ensemble module to reduce prompt dependency in image-to-text scores by dynamically selecting optimal prompts for each class. Furthermore, ETTA adaptively combines scores from both modules based on confidence levels, leveraging their complementary strengths. Extensive experiments on two benchmarks confirm that ETTA surpasses the state-of-the-art TTA models in computational complexity and accuracy, setting a new standard for effective, efficient test-time adaptation. The code has been released at https://github.com/hamidreza-dastmalchi/ETTA.

Problem

Research questions and friction points this paper is trying to address.

Improves generalization of VLMs under distribution shifts

Enhances decision boundary with dynamic embedding updates

Reduces prompt dependency via adaptive ensemble module

Innovation

Methods, ideas, or system contributions that make the work stand out.

Recursive Updating module integrates all test samples

Adaptive Ensemble module reduces prompt dependency

Dynamically combines scores based on confidence levels

🔎 Similar Papers

Efficient Open Set Single Image Test Time Adaptation of Vision Language Models