MetaTPT: Meta Test-time Prompt Tuning for Vision-Language Models

📅 2025-12-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Vision-language models (e.g., CLIP) suffer significant performance degradation under test-time domain shift in zero-shot recognition. To address this, we propose MetaTPT—the first meta test-time prompting tuning framework—featuring a novel bi-level optimization paradigm: an outer loop meta-learns adaptive, learnable data augmentation policies, while an inner loop jointly optimizes prompt parameters and augmented views. We further introduce a self-supervised auxiliary task to dynamically generate discriminative sample views and enforce consistency regularization to enhance domain adaptability. Evaluated on multiple domain generalization and cross-dataset zero-shot benchmarks, MetaTPT consistently outperforms existing test-time prompting methods, achieving state-of-the-art performance. It substantially improves the robustness of CLIP and other vision-language models for out-of-distribution zero-shot recognition.

Technology Category

Application Category

📝 Abstract
Vision-language models (VLMs) such as CLIP exhibit strong zero-shot generalization but remain sensitive to domain shifts at test time. Test-time prompt tuning (TPT) mitigates this issue by adapting prompts with fixed augmentations, which may falter in more challenging settings. In this work, we propose Meta Test-Time Prompt Tuning (MetaTPT), a meta-learning framework that learns a self-supervised auxiliary task to guide test-time prompt tuning. The auxiliary task dynamically learns parameterized augmentations for each sample, enabling more expressive transformations that capture essential features in target domains. MetaTPT adopts a dual-loop optimization paradigm: an inner loop learns a self-supervised task that generates informative views, while the outer loop performs prompt tuning by enforcing consistency across these views. By coupling augmentation learning with prompt tuning, MetaTPT improves test-time adaptation under domain shifts. Extensive experiments demonstrate that MetaTPT achieves state-of-the-art performance on domain generalization and cross-dataset benchmarks.
Problem

Research questions and friction points this paper is trying to address.

Enhances test-time adaptation of vision-language models to domain shifts
Learns dynamic augmentations per sample for more expressive transformations
Improves generalization via meta-learning self-supervised auxiliary tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Meta-learning framework for test-time prompt tuning
Dual-loop optimization with self-supervised auxiliary task
Dynamic parameterized augmentations per sample for domain adaptation
🔎 Similar Papers
No similar papers found.
Y
Yuqing Lei
UCAS-Terminus AI Lab, University of Chinese Academic of Sciences
Yingjun Du
Yingjun Du
University of Amseterdam
Meta-learningVision-language model
Y
Yawen Huang
Jarvis Research Center, Tencent Youtu Lab
Xiantong Zhen
Xiantong Zhen
United Imaging
Medical Image AnalysisMachine LearningComputer Vision
L
Ling Shao
UCAS-Terminus AI Lab, University of Chinese Academic of Sciences