A Comprehensive Survey of Knowledge-Based Vision Question Answering Systems: The Lifecycle of Knowledge in Visual Reasoning Task

📅 2025-04-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
A systematic survey of knowledge-based visual question answering (KB-VQA) remains absent. This paper introduces the first comprehensive lifecycle taxonomy for KB-VQA, structured around a three-stage paradigm—knowledge representation, retrieval, and reasoning—to unify diverse multimodal knowledge integration techniques, including knowledge graphs, embeddings, retrieval-augmented generation (RAG), and large language models. We present the first structured KB-VQA landscape, categorizing 12 mainstream approaches and identifying three persistent bottlenecks: weak noise robustness, difficulty in dynamic knowledge updating, and insufficient explainability in reasoning. Furthermore, we distill six key open challenges and four actionable research directions. This work establishes a theoretical framework and practical roadmap for developing next-generation visual reasoning systems that are trustworthy and adaptive.

Technology Category

Application Category

📝 Abstract
Knowledge-based Vision Question Answering (KB-VQA) extends general Vision Question Answering (VQA) by not only requiring the understanding of visual and textual inputs but also extensive range of knowledge, enabling significant advancements across various real-world applications. KB-VQA introduces unique challenges, including the alignment of heterogeneous information from diverse modalities and sources, the retrieval of relevant knowledge from noisy or large-scale repositories, and the execution of complex reasoning to infer answers from the combined context. With the advancement of Large Language Models (LLMs), KB-VQA systems have also undergone a notable transformation, where LLMs serve as powerful knowledge repositories, retrieval-augmented generators and strong reasoners. Despite substantial progress, no comprehensive survey currently exists that systematically organizes and reviews the existing KB-VQA methods. This survey aims to fill this gap by establishing a structured taxonomy of KB-VQA approaches, and categorizing the systems into main stages: knowledge representation, knowledge retrieval, and knowledge reasoning. By exploring various knowledge integration techniques and identifying persistent challenges, this work also outlines promising future research directions, providing a foundation for advancing KB-VQA models and their applications.
Problem

Research questions and friction points this paper is trying to address.

Survey KB-VQA systems for knowledge integration challenges
Address knowledge representation, retrieval, reasoning in KB-VQA
Outline future research directions for KB-VQA advancements
Innovation

Methods, ideas, or system contributions that make the work stand out.

LLMs as knowledge repositories and reasoners
Structured taxonomy for KB-VQA approaches
Integration of heterogeneous multi-modal information
🔎 Similar Papers
No similar papers found.