🤖 AI Summary
Existing vision-language models (VLMs) rely on task-specific parameter fine-tuning under zero-/few-shot settings, hindering test-time adaptation without training—especially for out-of-distribution generalization and fine-grained recognition. This work proposes a fine-tuning-free, dynamic graph-based test-time adaptation method: it constructs a context-aware manifold graph over test samples, jointly leverages contrastive vision-language features and dynamically modeled similarity, and incorporates a feature reweighting mechanism to enable end-to-end label propagation inference. To our knowledge, this is the first approach to generate adaptive graph structures and perform efficient inductive inference at test time—without requiring auxiliary unlabeled support sets. Experiments demonstrate substantial improvements over state-of-the-art zero-/few-shot baselines on fine-grained classification and out-of-distribution generalization tasks, achieving an average accuracy gain of 6.2% and a 47% reduction in inference latency.
📝 Abstract
Vision-language models (VLMs) have revolutionized machine learning by leveraging large pre-trained models to tackle various downstream tasks. Despite improvements in label, training, and data efficiency, many state-of-the-art VLMs still require task-specific hyperparameter tuning and fail to fully exploit test samples. To overcome these challenges, we propose a graph-based approach for label-efficient adaptation and inference. Our method dynamically constructs a graph over text prompts, few-shot examples, and test samples, using label propagation for inference without task-specific tuning. Unlike existing zero-shot label propagation techniques, our approach requires no additional unlabeled support set and effectively leverages the test sample manifold through dynamic graph expansion. We further introduce a context-aware feature re-weighting mechanism to improve task adaptation accuracy. Additionally, our method supports efficient graph expansion, enabling real-time inductive inference. Extensive evaluations on downstream tasks, such as fine-grained categorization and out-of-distribution generalization, demonstrate the effectiveness of our approach.