Enhancing Visual Question Answering with Multimodal LLMs via Chain-of-Question Guided Retrieval-Augmented Generation

📅 2026-05-05
📈 Citations: 0
Influential: 0
📄 PDF

career value

181K/year
🤖 AI Summary
This work addresses the limitations of insufficient external knowledge utilization and constrained reasoning capabilities in open-domain visual question answering by proposing a logical prompting strategy, CoVQD, and a retrieval-augmented generation framework, CgRAG. The approach uniquely integrates Chain-of-Thought (CoT) reasoning with Visual Question Decomposition (VQD) to guide multimodal large language models toward structured reasoning and effective knowledge integration. Evaluated on challenging cross-domain benchmarks—including E-VQA, InfoSeek, and OKVQA—the method significantly outperforms existing state-of-the-art techniques, demonstrating its effectiveness in enhancing both reasoning accuracy and system robustness.
📝 Abstract
With advances in multimodal research and deep learning, Multimodal Large Language Models (MLLMs) have emerged as a powerful paradigm for a wide range of multimodal tasks. As a core problem in vision-language research, Visual Question Answering (VQA) has increasingly employed MLLMs to improve performance, particularly in open-domain settings where external knowledge is essential. In this work, we aim to further enhance retrieval-based VQA by more effectively integrating MLLMs with structured reasoning and knowledge acquisition. We introduce a logical prompting strategy that fuses Chain-of-Thought (CoT) reasoning with Visual Question Decomposition (VQD), termed CoVQD, to guide retrieval toward more accurate and relevant knowledge for MLLM inference. Building on this idea, we propose a new framework, CoVQD-guided RAG (CgRAG), which enables MLLMs to access more comprehensive and coherent external knowledge while benefiting from structured visual-text reasoning guidance, thereby improving generalization and reliability in complex cross-domain VQA scenarios. Extensive experiments on E-VQA, InfoSeek, and OKVQA benchmarks demonstrate the effectiveness of the proposed method.
Problem

Research questions and friction points this paper is trying to address.

Visual Question Answering
Multimodal Large Language Models
Retrieval-Augmented Generation
External Knowledge
Open-domain VQA
Innovation

Methods, ideas, or system contributions that make the work stand out.

Chain-of-Thought
Visual Question Decomposition
Retrieval-Augmented Generation
Multimodal LLMs
Visual Question Answering
Q
Quanxing Xu
School of Computer Science and Engineering, Macau University of Science and Technology, Macao SAR 999078, China
L
Ling Zhou
School of Computer Science and Engineering, Macau University of Science and Technology, Macao SAR 999078, China
X
Xian Zhong
Hubei Key Laboratory of Transportation Internet of Things, School of Computer Science and Artificial Intelligence, Wuhan University of Technology, Wuhan, Hubei 430070, China; and State Key Laboratory of Maritime Technology and Safety, Wuhan University of Technology, Wuhan 430063, China
Xiaohua Huang
Xiaohua Huang
The University of Memphis
Cancer Nanomedicine
Rubing Huang
Rubing Huang
Macau University of Science and Technology
AI for Software EngineeringSoftware Engineering for AISoftware TestingAI Applications
C
Chia-Wen Lin
Department of Electrical Engineering, National Tsing Hua University, Hsinchu 30013, Taiwan