Retrieval-Augmented Visual Question Answering via Built-in Autoregressive Search Engines

📅 2025-02-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the accuracy bottleneck in knowledge-intensive visual question answering (VQA) caused by the decoupling of retrieval and generation modules, this paper proposes ReAuSE: a framework that embeds an autoregressive knowledge retriever directly into a multimodal large language model (MLLM), enabling unified retrieval-generation modeling. Its key contributions are: (1) the first end-to-end autoregressive search architecture, which generates document ID sequences for retrieval; (2) a relevance-feedback-driven reinforcement calibration module that explicitly aligns retrieval preferences with final answer accuracy; and (3) support for joint retrieval-generation optimization and end-to-end training. Evaluated on OKVQA and A-OKVQA, ReAuSE consistently outperforms strong baselines, achieving absolute accuracy gains of 2.9–9.6%, while significantly improving retrieval quality and answer reliability.

Technology Category

Application Category

📝 Abstract
Retrieval-augmented generation (RAG) has emerged to address the knowledge-intensive visual question answering (VQA) task. Current methods mainly employ separate retrieval and generation modules to acquire external knowledge and generate answers, respectively. We propose ReAuSE, an alternative to the previous RAG model for the knowledge-based VQA task, which seamlessly integrates knowledge retriever into the generative multi-modal large language model, serving as a built-in search engine. Specifically, our model functions both as a generative retriever and an accurate answer generator. It not only helps retrieve documents from the knowledge base by producing identifiers for each document, but it also answers visual questions based on the retrieved documents. Furthermore, we propose a reinforced retrieval calibration module from relevance feedback to improve retrieval performance and align with the preferences for accurate answer generation. Extensive experiments on two representative OKVQA and A-OKVQA datasets demonstrate significant improvements ranging from 2.9% to 9.6% across all evaluation metrics when compared to strong baselines.
Problem

Research questions and friction points this paper is trying to address.

Integrates retrieval and generation in VQA
Improves knowledge retrieval for accurate answers
Enhances visual question answering performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates retriever into generative model
Functions as generative retriever
Uses reinforced retrieval calibration module
Xinwei Long
Xinwei Long
Tsinghua University
natural language processingmulti-modal learning
Z
Zhiyuan Ma
Department of Electronic Engineering, Tsinghua University
Ermo Hua
Ermo Hua
Tsinghua University
Physics-driven Foundation Model
Kaiyan Zhang
Kaiyan Zhang
Tsinghua University
Foundation ModelCollective IntelligenceScientific Intelligence
B
Biqing Qi
Shanghai Artificial Intelligence Laboratory
B
Bowen Zhou
Department of Electronic Engineering, Tsinghua University