ETTA: Efficient Test-Time Adaptation for Vision-Language Models through Dynamic Embedding Updates

📅 2025-08-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Vision-language models (e.g., CLIP) suffer from poor zero-shot generalization under distribution shifts. Method: We propose Efficient Test-Time Adaptation (ETTA), a training-free, gradient-free approach that recursively updates contextual embeddings to emulate an unbounded cache—thereby incorporating all test samples for continuous boundary refinement. ETTA introduces a confidence-driven adaptive prompt ensembling module that dynamically selects optimal textual prompts without parameter updates, relying solely on the pre-trained model and lightweight caching. Contribution/Results: On multiple distribution shift benchmarks, ETTA achieves state-of-the-art performance, improving average accuracy by 3.2% over prior methods. Crucially, it incurs negligible computational and memory overhead—demonstrating both high efficiency and strong robustness under distribution shifts.

Technology Category

Application Category

📝 Abstract
Pretrained vision-language models (VLMs) like CLIP show strong zero-shot performance but struggle with generalization under distribution shifts. Test-Time Adaptation (TTA) addresses this by adapting VLMs to unlabeled test data in new domains. While some TTA methods rely on prompt-tuning, training-free cache-based approaches are preferred for efficiency. However, current cache-based TTA models store only a limited set of high-confidence samples, restricting the decision boundary to these samples and ignoring the influence of other incoming test data. To address this, we propose Efficient Test-Time Adaptation (ETTA), introducing a Recursive Updating module that integrates all incoming test samples, progressively refining the decision boundary. This strategy mimics an unbounded cache, dynamically updating contextual embeddings for improved accuracy with minimal memory and computational overhead. ETTA also includes an Adaptive Ensemble module to reduce prompt dependency in image-to-text scores by dynamically selecting optimal prompts for each class. Furthermore, ETTA adaptively combines scores from both modules based on confidence levels, leveraging their complementary strengths. Extensive experiments on two benchmarks confirm that ETTA surpasses the state-of-the-art TTA models in computational complexity and accuracy, setting a new standard for effective, efficient test-time adaptation. The code has been released at https://github.com/hamidreza-dastmalchi/ETTA.
Problem

Research questions and friction points this paper is trying to address.

Improves generalization of VLMs under distribution shifts
Enhances decision boundary with dynamic embedding updates
Reduces prompt dependency via adaptive ensemble module
Innovation

Methods, ideas, or system contributions that make the work stand out.

Recursive Updating module integrates all test samples
Adaptive Ensemble module reduces prompt dependency
Dynamically combines scores based on confidence levels
🔎 Similar Papers
No similar papers found.