Doubly Debiased Test-Time Prompt Tuning for Vision-Language Models

📅 2025-11-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In zero-shot prompt tuning, entropy minimization solely on unlabeled test data induces dual biases: model-level overconfidence in incorrect predictions and data-level misalignment between visual and textual modalities. This work is the first to systematically analyze the origins of these biases from both model and data perspectives. We propose a dual-debiasing framework comprising: (1) dynamic knowledge-augmented prompt modulation to enhance the reliability of prompt priors; and (2) confidence-weighted ensemble learning coupled with cross-modal consistency distillation to jointly enforce modality alignment and prediction calibration. Evaluated on 15 benchmarks, our method achieves state-of-the-art performance under both natural distribution shifts and cross-dataset zero-shot settings, significantly outperforming existing approaches. It effectively mitigates overconfidence and visual–textual modality misalignment, demonstrating robust generalization without access to labeled training data.

Technology Category

Application Category

📝 Abstract
Test-time prompt tuning for vision-language models has demonstrated impressive generalization capabilities under zero-shot settings. However, tuning the learnable prompts solely based on unlabeled test data may induce prompt optimization bias, ultimately leading to suboptimal performance on downstream tasks. In this work, we analyze the underlying causes of prompt optimization bias from both the model and data perspectives. In terms of the model, the entropy minimization objective typically focuses on reducing the entropy of model predictions while overlooking their correctness. This can result in overconfident yet incorrect outputs, thereby compromising the quality of prompt optimization. On the data side, prompts affected by optimization bias can introduce misalignment between visual and textual modalities, which further aggravates the prompt optimization bias. To this end, we propose a Doubly Debiased Test-Time Prompt Tuning method. Specifically, we first introduce a dynamic retrieval-augmented modulation module that retrieves high-confidence knowledge from a dynamic knowledge base using the test image feature as a query, and uses the retrieved knowledge to modulate the predictions. Guided by the refined predictions, we further develop a reliability-aware prompt optimization module that incorporates a confidence-based weighted ensemble and cross-modal consistency distillation to impose regularization constraints during prompt tuning. Extensive experiments across 15 benchmark datasets involving both natural distribution shifts and cross-datasets generalization demonstrate that our method outperforms baselines, validating its effectiveness in mitigating prompt optimization bias.
Problem

Research questions and friction points this paper is trying to address.

Mitigating prompt optimization bias in vision-language models during test-time tuning
Addressing overconfident incorrect predictions from entropy minimization objectives
Correcting visual-textual misalignment caused by biased prompt optimization
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic retrieval-augmented modulation using test image queries
Reliability-aware prompt optimization with confidence-based ensemble
Cross-modal consistency distillation for regularization during tuning
🔎 Similar Papers
No similar papers found.
F
Fei Song
National Key Laboratory of Space Integrated Information System, Institute of Software, Chinese Academy of Sciences; University of Chinese Academy of Sciences
Y
Yi Li
National Key Laboratory of Space Integrated Information System, Institute of Software, Chinese Academy of Sciences; University of Chinese Academy of Sciences
R
Rui Wang
National Key Laboratory of Space Integrated Information System, Institute of Software, Chinese Academy of Sciences; University of Chinese Academy of Sciences
Jiahuan Zhou
Jiahuan Zhou
Peking University
Computer VisionMachine LearningDeep Learning
Changwen Zheng
Changwen Zheng
中国科学院软件研究所
机器学习、计算机仿真
Jiangmeng Li
Jiangmeng Li
Institute of Software, Chinese Academy of Science
Multi-modal learningSelf-supervised learningDomain generalizationCausal learning