🤖 AI Summary
To address the poor calibration and overconfidence of multimodal Visual Question Answering (VQA) models under out-of-distribution (OOD) conditions—compromising their reliability—this work introduces IVON, a variational optimization algorithm, to VQA for the first time. We propose a Bayesian variational inference-based training paradigm that replaces standard AdamW fine-tuning. By explicitly learning the posterior distribution over model parameters, our method captures predictive uncertainty while preserving accuracy. Empirical evaluation shows substantial improvements in reliability: Expected Calibration Error (ECE) decreases by over 50% and coverage under 1% risk constraint increases by 4% compared to the AdamW baseline. Under a challenging 50% OOD test setting, our approach achieves an 8% absolute gain in coverage over the current state-of-the-art, demonstrating significantly enhanced robustness and trustworthiness.
📝 Abstract
Despite remarkable progress in multimodal models for Visual Question Answering (VQA), there remain major reliability concerns because the models can often be overconfident and miscalibrated, especially in out-of-distribution (OOD) settings. Plenty has been done to address such issues for unimodal models, but little work exists for multimodal cases. Here, we address unreliability in multimodal models by proposing a Variational VQA approach. Specifically, instead of fine-tuning vision-language models by using AdamW, we employ a recently proposed variational algorithm called IVON, which yields a posterior distribution over model parameters. Through extensive experiments, we show that our approach improves calibration and abstentions without sacrificing the accuracy of AdamW. For instance, compared to AdamW fine-tuning, we reduce Expected Calibration Error by more than 50% compared to the AdamW baseline and raise Coverage by 4% vs. SOTA (for a fixed risk of 1%). In the presence of distribution shifts, the performance gain is even higher, achieving 8% Coverage (@ 1% risk) improvement vs. SOTA when 50% of test cases are OOD. Overall, we present variational learning as a viable option to enhance the reliability of multimodal models.