🤖 AI Summary
This work addresses the poor uncertainty calibration of vision-language models under test-time prompt tuning, where overconfidence often undermines reliability. While existing full orthogonality constraints improve prototype separation, they inadvertently disrupt the proximity of semantically related categories. This study is the first to identify and analyze this detrimental trade-off. To reconcile prototype separation with semantic coherence, we propose a novel semantic orthogonality calibration method based on Huber regularization. Our approach effectively preserves semantic similarity among related classes while enhancing inter-class separability. Extensive experiments across multiple benchmarks demonstrate that the proposed method significantly improves calibration performance without compromising discriminative capability, achieving competitive accuracy alongside well-calibrated predictions.
📝 Abstract
With the increasing adoption of vision-language models (VLMs) in critical decision-making systems such as healthcare or autonomous driving, the calibration of their uncertainty estimates becomes paramount. Yet, this dimension has been largely underexplored in the VLM test-time prompt-tuning (TPT) literature, which has predominantly focused on improving their discriminative performance. Recent state-of-the-art advocates for enforcing full orthogonality over pairs of text prompt embeddings to enhance separability, and therefore calibration. Nevertheless, as we theoretically show in this work, the inherent gradients from fully orthogonal constraints will strongly push semantically related classes away, ultimately making the model overconfident. Based on our findings, we propose Semantic Orthogonal Calibration (SoC), a Huber-based regularizer that enforces smooth prototype separation while preserving semantic proximity, thereby improving calibration compared to prior orthogonality-based approaches. Across a comprehensive empirical validation, we demonstrate that SoC consistently improves calibration performance, while also maintaining competitive discriminative capabilities.