🤖 AI Summary
Existing test-time adaptation (TTA) research for vision-language models (VLMs) suffers from inconsistent baseline reproduction, narrow evaluation dimensions, non-uniform experimental protocols, and insufficient analytical depth—undermining fair method comparison. To address this, we introduce TTA-VLM, the first comprehensive TTA benchmark specifically designed for VLMs. It encompasses 15 diverse datasets and supports 8 episodic and 7 online TTA methods, uniquely integrating SigLIP and training-time tuning approaches (e.g., CoOp, MaPLe, TeCoA) within a unified, reproducible framework. We propose a multi-dimensional evaluation suite incorporating robustness, calibration, out-of-distribution detection, and stability metrics. Our analysis reveals that current TTA methods yield marginal accuracy gains, exhibit weak synergy with pre-trained representations, and often sacrifice model trustworthiness (e.g., calibration, uncertainty estimation) for marginal performance improvements. TTA-VLM is publicly released to foster rigorous, equitable, and holistic research on trustworthy adaptive VLMs.
📝 Abstract
Test-time adaptation (TTA) methods have gained significant attention for enhancing the performance of vision-language models (VLMs) such as CLIP during inference, without requiring additional labeled data. However, current TTA researches generally suffer from major limitations such as duplication of baseline results, limited evaluation metrics, inconsistent experimental settings, and insufficient analysis. These problems hinder fair comparisons between TTA methods and obscure their practical strengths and weaknesses. To address these challenges, we introduce TTA-VLM, a comprehensive benchmark for evaluating TTA methods on VLMs. Our benchmark implements 8 episodic TTA and 7 online TTA methods within a unified and reproducible framework, and evaluates them across 15 widely used datasets. Unlike prior studies focused solely on CLIP, we extend the evaluation to SigLIP--a model trained with a Sigmoid loss--and include training-time tuning methods such as CoOp, MaPLe, and TeCoA to assess generality. Beyond classification accuracy, TTA-VLM incorporates various evaluation metrics, including robustness, calibration, out-of-distribution detection, and stability, enabling a more holistic assessment of TTA methods. Through extensive experiments, we find that 1) existing TTA methods produce limited gains compared to the previous pioneering work; 2) current TTA methods exhibit poor collaboration with training-time fine-tuning methods; 3) accuracy gains frequently come at the cost of reduced model trustworthiness. We release TTA-VLM to provide fair comparison and comprehensive evaluation of TTA methods for VLMs, and we hope it encourages the community to develop more reliable and generalizable TTA strategies.