🤖 AI Summary
Current vision-language models (VLMs) exhibit limited robustness in real-world visual question answering (VQA) due to insufficient commonsense reasoning capabilities. To address this, we propose a three-stage commonsense-enhanced framework: (1) explicit injection of external structured commonsense knowledge; (2) type-aware contextual post-processing; and (3) implicit commonsense modeling via graph neural networks (GNNs). Our approach is the first to synergistically integrate large vision-language models (LVLMs) with GNNs for commonsense fusion—without requiring large-scale pretraining or intricate prompt engineering—thereby unifying explicit and implicit commonsense representation learning with multimodal reasoning. Evaluated on mainstream VQA benchmarks, our method achieves state-of-the-art performance, with particularly significant accuracy gains on commonsense-sensitive questions. Empirical results demonstrate that structured commonsense knowledge provides critical, measurable improvements to multimodal inference capability.
📝 Abstract
Visual Question Answering (VQA) requires reasoning across visual and textual modalities, yet Large Vision-Language Models (LVLMs) often lack integrated commonsense knowledge, limiting their robustness in real-world scenarios. To address this, we introduce MAGIC-VQA, a novel framework that enhances VQA by systematically integrating commonsense knowledge with LVLMs. MAGIC-VQA employs a three-stage process: (1) Explicit Knowledge Integration from external sources, (2) By-Type Post-Processing for contextual refinement, and (3) Implicit Knowledge Augmentation using a Graph Neural Network (GNN) for structured reasoning. While GNNs bring greater depth to structured inference, they enable superior relational inference beyond LVLMs. MAGIC-VQA bridges a key gap by unifying commonsensse knowledge with LVLM-driven reasoning, eliminating the need for extensive pre-training or complex prompt tuning. Our framework achieves state-of-the-art performance on benchmark datasets, significantly improving commonsense reasoning in VQA.