Towards Calibrating Prompt Tuning of Vision-Language Models

📅 2026-02-21

📈 Citations: 0

✨ Influential: 0

career value

170K/year

🤖 AI Summary

This work addresses the poor confidence calibration and unreliable prediction uncertainty commonly induced by prompt tuning in vision-language models. To this end, the authors propose a novel calibration framework that jointly optimizes calibration performance and semantic generalization while preserving the geometric structure of CLIP’s pretrained embedding space. The method introduces a dual-regularization mechanism built upon the cross-entropy loss, incorporating a mean-variance margin penalty and a textual moment-matching loss to effectively integrate prompt tuning with uncertainty calibration. Extensive experiments across seven prompt-tuning methods and eleven datasets demonstrate that the proposed approach significantly reduces Expected Calibration Error (ECE) and consistently outperforms existing calibration techniques.

Technology Category

Application Category

📝 Abstract

Prompt tuning of large-scale vision-language models such as CLIP enables efficient task adaptation without updating model weights. However, it often leads to poor confidence calibration and unreliable predictive uncertainty. We address this problem by proposing a calibration framework that enhances predictive reliability while preserving the geometry of the pretrained CLIP embedding space, which is required for robust generalization. Our approach extends the standard cross-entropy loss with two complementary regularizers: (1) a mean-variance margin penalty that stabilizes inter-class logit margins by maximizing their average while minimizing dispersion, mitigating underconfidence and overconfidence spikes; and (2) a text moment-matching loss that aligns the first and second moments of tuned text embeddings with their frozen CLIP counterparts, preserving semantic dispersion crucial for generalization. Through extensive experiments across 7 prompt-tuning methods and 11 diverse datasets, we demonstrate that our approach significantly reduces the Expected Calibration Error (ECE) compared to competitive calibration techniques on both base and novel classes

Problem

Research questions and friction points this paper is trying to address.

prompt tuning

vision-language models

confidence calibration

predictive uncertainty

CLIP

Innovation

Methods, ideas, or system contributions that make the work stand out.

prompt tuning

confidence calibration

vision-language models