🤖 AI Summary
This work investigates the impact of chain-of-thought (CoT) reasoning on uncertainty quantification (UQ) in vision-language models. While CoT enhances task accuracy, it induces overconfidence by implicitly conditioning answers, thereby undermining the reliability of UQ methods. This study is the first to uncover this mechanism and systematically evaluates a range of UQ approaches under CoT reasoning. Experimental results demonstrate that mainstream UQ techniques suffer significant performance degradation when applied within CoT frameworks. In contrast, consistency-based UQ methods not only maintain robustness but also exhibit improved reliability as the reasoning chain lengthens. These findings offer a viable pathway toward trustworthy vision-language reasoning in high-stakes applications where calibrated uncertainty estimates are critical.
📝 Abstract
Vision-language models (VLMs) are increasingly deployed in high-stakes settings where reliable uncertainty quantification (UQ) is as important as predictive accuracy. Extended reasoning via chain-of-thought (CoT) prompting or reasoning-trained models has become ubiquitous in modern VLM pipelines, yet its effect on UQ reliability remains poorly understood. We show that reasoning consistently degrades the quality of most uncertainty estimates, even when it improves task accuracy. We identify implicit answer conditioning as the primary mechanism: as reasoning traces converge on a conclusion before the final answer is generated, token probabilities increasingly reflect consistency with the model's own reasoning trace rather than uncertainty about correctness. In effect, the model becomes overconfident in its answer. In contrast, agreement-based consistency remains robust and often improves under reasoning, making it a practical choice for uncertainty estimation in reasoning-enabled VLMs.