🤖 AI Summary
Vision-language models (VLMs) suffer from miscalibration under test-time prompting, where predicted confidence scores poorly align with actual accuracy, undermining model trustworthiness. This work first identifies an intrinsic link between textual feature discreteness and calibration error. To address this, we propose a label-free, fine-tuning-free orthogonal prompt learning method: by enforcing orthogonality constraints on textual features within the prompt embedding space, we explicitly enhance the alignment between prediction confidence and empirical accuracy. Our approach is plug-and-play—requiring no architectural modification or parameter updates—and achieves consistent improvements across diverse VLMs (e.g., CLIP) and benchmark datasets. It significantly reduces the Expected Calibration Error (ECE) averaged over multiple datasets and backbones, outperforming existing state-of-the-art calibration methods. Moreover, it establishes new performance benchmarks on fine-grained classification tasks, surpassing zero-shot calibration baselines.
📝 Abstract
Test-time prompt tuning for vision-language models (VLMs) is getting attention because of their ability to learn with unlabeled data without fine-tuning. Although test-time prompt tuning methods for VLMs can boost accuracy, the resulting models tend to demonstrate poor calibration, which casts doubts on the reliability and trustworthiness of these models. Notably, more attention needs to be devoted to calibrating the test-time prompt tuning in vision-language models. To this end, we propose a new approach, called O-TPT that introduces orthogonality constraints on the textual features corresponding to the learnable prompts for calibrating test-time prompt tuning in VLMs. Towards introducing orthogonality constraints, we make the following contributions. First, we uncover new insights behind the suboptimal calibration performance of existing methods relying on textual feature dispersion. Second, we show that imposing a simple orthogonalization of textual features is a more effective approach towards obtaining textual dispersion. We conduct extensive experiments on various datasets with different backbones and baselines. The results indicate that our method consistently outperforms the prior state of the art in significantly reducing the overall average calibration error. Also, our method surpasses the zero-shot calibration performance on fine-grained classification tasks.