Enhancing Visual Question Answering with Multimodal LLMs via Chain-of-Question Guided Retrieval-Augmented Generation

📅 2026-05-05

📈 Citations: 0

✨ Influential: 0

career value

180K/year

🤖 AI Summary

This work addresses the limitations of insufficient external knowledge utilization and constrained reasoning capabilities in open-domain visual question answering by proposing a logical prompting strategy, CoVQD, and a retrieval-augmented generation framework, CgRAG. The approach uniquely integrates Chain-of-Thought (CoT) reasoning with Visual Question Decomposition (VQD) to guide multimodal large language models toward structured reasoning and effective knowledge integration. Evaluated on challenging cross-domain benchmarks—including E-VQA, InfoSeek, and OKVQA—the method significantly outperforms existing state-of-the-art techniques, demonstrating its effectiveness in enhancing both reasoning accuracy and system robustness.

📝 Abstract

With advances in multimodal research and deep learning, Multimodal Large Language Models (MLLMs) have emerged as a powerful paradigm for a wide range of multimodal tasks. As a core problem in vision-language research, Visual Question Answering (VQA) has increasingly employed MLLMs to improve performance, particularly in open-domain settings where external knowledge is essential. In this work, we aim to further enhance retrieval-based VQA by more effectively integrating MLLMs with structured reasoning and knowledge acquisition. We introduce a logical prompting strategy that fuses Chain-of-Thought (CoT) reasoning with Visual Question Decomposition (VQD), termed CoVQD, to guide retrieval toward more accurate and relevant knowledge for MLLM inference. Building on this idea, we propose a new framework, CoVQD-guided RAG (CgRAG), which enables MLLMs to access more comprehensive and coherent external knowledge while benefiting from structured visual-text reasoning guidance, thereby improving generalization and reliability in complex cross-domain VQA scenarios. Extensive experiments on E-VQA, InfoSeek, and OKVQA benchmarks demonstrate the effectiveness of the proposed method.

Problem

Research questions and friction points this paper is trying to address.

Visual Question Answering

Multimodal Large Language Models

Retrieval-Augmented Generation

External Knowledge

Open-domain VQA

Innovation

Methods, ideas, or system contributions that make the work stand out.

Chain-of-Thought

Visual Question Decomposition

Retrieval-Augmented Generation