Retrieval Augmented Generation and Understanding in Vision: A Survey and New Outlook

📅 2025-03-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses the limited generalizability and interpretability of vision models stemming from their overreliance on internal parametric knowledge. To tackle this, it presents a systematic survey of Retrieval-Augmented Generation (RAG) in computer vision. We propose the first unified CV-RAG framework, establishing a “cross-modal retrieval–understanding–generation” paradigm that spans both vision understanding (e.g., medical report generation, multimodal question answering) and vision generation (e.g., image, video, and 3D synthesis), while extending to embodied intelligence tasks—including perception, planning, and interaction. Integrating techniques such as multimodal retrieval, vision-language joint encoding, external knowledge alignment, retrieval-guided diffusion generation, and embodied environment modeling, we comprehensively review over 100 works and introduce the first taxonomy for CV-RAG. Our analysis identifies three core challenges: knowledge timeliness, cross-modal alignment, and reasoning interpretability—thereby charting future directions, notably embodied RAG.

Technology Category

Application Category

📝 Abstract
Retrieval-augmented generation (RAG) has emerged as a pivotal technique in artificial intelligence (AI), particularly in enhancing the capabilities of large language models (LLMs) by enabling access to external, reliable, and up-to-date knowledge sources. In the context of AI-Generated Content (AIGC), RAG has proven invaluable by augmenting model outputs with supplementary, relevant information, thus improving their quality. Recently, the potential of RAG has extended beyond natural language processing, with emerging methods integrating retrieval-augmented strategies into the computer vision (CV) domain. These approaches aim to address the limitations of relying solely on internal model knowledge by incorporating authoritative external knowledge bases, thereby improving both the understanding and generation capabilities of vision models. This survey provides a comprehensive review of the current state of retrieval-augmented techniques in CV, focusing on two main areas: (I) visual understanding and (II) visual generation. In the realm of visual understanding, we systematically review tasks ranging from basic image recognition to complex applications such as medical report generation and multimodal question answering. For visual content generation, we examine the application of RAG in tasks related to image, video, and 3D generation. Furthermore, we explore recent advancements in RAG for embodied AI, with a particular focus on applications in planning, task execution, multimodal perception, interaction, and specialized domains. Given that the integration of retrieval-augmented techniques in CV is still in its early stages, we also highlight the key limitations of current approaches and propose future research directions to drive the development of this promising area.
Problem

Research questions and friction points this paper is trying to address.

Enhancing vision models with external knowledge for better understanding.
Applying retrieval-augmented techniques to visual content generation tasks.
Addressing limitations of current CV methods via RAG integration.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Retrieval-augmented generation enhances vision models
Integration of external knowledge in computer vision
Survey of RAG in visual understanding and generation