Online Gaussian Test-Time Adaptation of Vision-Language Models

📅 2025-01-08

📈 Citations: 0

✨ Influential: 0

career value

212K/year

🤖 AI Summary

Existing online test-time adaptation (OTTA) methods for vision-language models (VLMs) rely on dataset-specific hyperparameter tuning, limiting their applicability to unseen streaming vision-language tasks. Method: We propose the first parameter-free OTTA framework for VLMs, leveraging Gaussian feature modeling and zero-shot prior fusion within a maximum a posteriori (MAP) optimization paradigm to enable real-time adaptation under distribution shifts—eliminating manual hyperparameter selection. We further introduce multi-round statistical evaluation and a novel metric, Error-Tail Accuracy (ETA), to expose OTTA evaluation fragility. Contributions/Results: Our method achieves state-of-the-art performance across mainstream benchmarks; demonstrates synergistic gains when integrated with few-shot learning; and improves ETA—the accuracy over the worst-performing 10% of runtime instances—enhancing stability and robust generalization under distributional shift.

Technology Category

Application Category

📝 Abstract

Online test-time adaptation (OTTA) of vision-language models (VLMs) has recently garnered increased attention to take advantage of data observed along a stream to improve future predictions. Unfortunately, existing methods rely on dataset-specific hyperparameters, significantly limiting their adaptability to unseen tasks. In response, we propose Online Gaussian Adaptation (OGA), a novel method that models the likelihoods of visual features using Gaussian distributions and incorporates zero-shot priors into an interpretable Maximum A Posteriori (MAP) estimation framework with fixed hyper-parameters across all datasets. We demonstrate that OGA outperforms state-of-the-art methods on most datasets and runs. Additionally, we show that combining OTTA with popular few-shot techniques (a practical yet overlooked setting in prior research) is highly beneficial. Furthermore, our experimental study reveals that common OTTA evaluation protocols, which average performance over at most three runs per dataset, are inadequate due to the substantial variability observed across runs for all OTTA methods. Therefore, we advocate for more rigorous evaluation practices, including increasing the number of runs and considering additional quantitative metrics, such as our proposed Expected Tail Accuracy (ETA), calculated as the average accuracy in the worst 10% of runs. We hope these contributions will encourage more rigorous and diverse evaluation practices in the OTTA community. Code is available at https://github.com/cfuchs2023/OGA .

Problem

Research questions and friction points this paper is trying to address.

Online Testing Time Adaptation

Visual Language Models

Parameter Adjustment

Innovation

Methods, ideas, or system contributions that make the work stand out.

Online Gaussian Adaptation

Visual Language Models

Evaluation Standards

🔎 Similar Papers

Efficient Open Set Single Image Test Time Adaptation of Vision Language Models