Variational Visual Question Answering

📅 2025-05-14

📈 Citations: 0

✨ Influential: 0

career value

162K/year

🤖 AI Summary

To address the poor calibration and overconfidence of multimodal Visual Question Answering (VQA) models under out-of-distribution (OOD) conditions—compromising their reliability—this work introduces IVON, a variational optimization algorithm, to VQA for the first time. We propose a Bayesian variational inference-based training paradigm that replaces standard AdamW fine-tuning. By explicitly learning the posterior distribution over model parameters, our method captures predictive uncertainty while preserving accuracy. Empirical evaluation shows substantial improvements in reliability: Expected Calibration Error (ECE) decreases by over 50% and coverage under 1% risk constraint increases by 4% compared to the AdamW baseline. Under a challenging 50% OOD test setting, our approach achieves an 8% absolute gain in coverage over the current state-of-the-art, demonstrating significantly enhanced robustness and trustworthiness.

Technology Category

Application Category

📝 Abstract

Despite remarkable progress in multimodal models for Visual Question Answering (VQA), there remain major reliability concerns because the models can often be overconfident and miscalibrated, especially in out-of-distribution (OOD) settings. Plenty has been done to address such issues for unimodal models, but little work exists for multimodal cases. Here, we address unreliability in multimodal models by proposing a Variational VQA approach. Specifically, instead of fine-tuning vision-language models by using AdamW, we employ a recently proposed variational algorithm called IVON, which yields a posterior distribution over model parameters. Through extensive experiments, we show that our approach improves calibration and abstentions without sacrificing the accuracy of AdamW. For instance, compared to AdamW fine-tuning, we reduce Expected Calibration Error by more than 50% compared to the AdamW baseline and raise Coverage by 4% vs. SOTA (for a fixed risk of 1%). In the presence of distribution shifts, the performance gain is even higher, achieving 8% Coverage (@ 1% risk) improvement vs. SOTA when 50% of test cases are OOD. Overall, we present variational learning as a viable option to enhance the reliability of multimodal models.

Problem

Research questions and friction points this paper is trying to address.

Addressing reliability concerns in multimodal VQA models

Improving calibration and abstention in out-of-distribution settings

Enhancing model reliability using variational learning approach

Innovation

Methods, ideas, or system contributions that make the work stand out.

Variational VQA approach enhances reliability

IVON algorithm replaces AdamW for fine-tuning

Improves calibration and coverage in OOD

🔎 Similar Papers

No similar papers found.