Calibrating Uncertainty Quantification of Multi-Modal LLMs using Grounding

📅 2025-04-30

📈 Citations: 0

✨ Influential: 0

career value

171K/year

🤖 AI Summary

Current multimodal large language models (MLLMs) suffer from poor calibration in uncertainty quantification (UQ): they frequently overestimate confidence for incorrect yet internally consistent multi-response outputs, resulting in high expected calibration error (ECE). To address this, we propose the first UQ calibration framework that jointly leverages vision–text cross-modal grounding consistency and self-consistency—anchoring textual responses to visual inputs and applying temperature scaling for posterior calibration of noisy grounding model outputs. Our method is the first to explicitly incorporate cross-modal consistency into the UQ pipeline, thereby improving alignment between predicted confidence and true accuracy. Extensive experiments on Slake (medical QA) and VQAv2 (visual QA), built upon LLaVA-Med and LLaVA architectures, demonstrate that our approach significantly reduces ECE and outperforms existing UQ methods in calibration performance.

Technology Category

Application Category

📝 Abstract

We introduce a novel approach for calibrating uncertainty quantification (UQ) tailored for multi-modal large language models (LLMs). Existing state-of-the-art UQ methods rely on consistency among multiple responses generated by the LLM on an input query under diverse settings. However, these approaches often report higher confidence in scenarios where the LLM is consistently incorrect. This leads to a poorly calibrated confidence with respect to accuracy. To address this, we leverage cross-modal consistency in addition to self-consistency to improve the calibration of the multi-modal models. Specifically, we ground the textual responses to the visual inputs. The confidence from the grounding model is used to calibrate the overall confidence. Given that using a grounding model adds its own uncertainty in the pipeline, we apply temperature scaling - a widely accepted parametric calibration technique - to calibrate the grounding model's confidence in the accuracy of generated responses. We evaluate the proposed approach across multiple multi-modal tasks, such as medical question answering (Slake) and visual question answering (VQAv2), considering multi-modal models such as LLaVA-Med and LLaVA. The experiments demonstrate that the proposed framework achieves significantly improved calibration on both tasks.

Problem

Research questions and friction points this paper is trying to address.

Calibrating uncertainty in multi-modal LLMs using cross-modal consistency

Improving confidence calibration via visual-textual grounding and temperature scaling

Evaluating framework on medical and visual QA tasks for better accuracy

Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverage cross-modal consistency for better calibration

Ground textual responses to visual inputs

Apply temperature scaling to calibrate grounding model

🔎 Similar Papers

Benchmarking Uncertainty Quantification Methods for Large Language Models with LM-Polygraph