ReAG: Reasoning-Augmented Generation for Knowledge-based Visual Question Answering

πŸ“… 2025-11-27
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
To address inaccurate external knowledge retrieval and weak reasoning in knowledge-based visual question answering (KB-VQA), this paper proposes a multimodal retrieval-augmented generation (RAG) framework. Our method integrates coarse-grained and fine-grained cross-modal retrieval, augmented by a critique model that dynamically filters noisy textual evidence. We further design a multi-stage reinforcement learning strategy to enhance stepwise, evidence-grounded reasoning and context-aware answer generation. Technically, the framework unifies multimodal large language models, hierarchical retrieval, RAG, supervised fine-tuning, and reinforcement learning. Evaluated on Encyclopedic-VQA and InfoSeek, our approach significantly outperforms state-of-the-art methods, achieving substantial gains in answer accuracy. Moreover, generated answers are highly interpretable and fully traceable to supporting evidence, ensuring transparency and reliability in KB-VQA.

Technology Category

Application Category

πŸ“ Abstract
Multimodal Large Language Models (MLLMs) have shown impressive capabilities in jointly understanding text, images, and videos, often evaluated via Visual Question Answering (VQA). However, even state-of-the-art MLLMs struggle with domain-specific or knowledge-intensive queries, where relevant information is underrepresented in pre-training data. Knowledge-based VQA (KB-VQA) addresses this by retrieving external documents to condition answer generation, but current retrieval-augmented approaches suffer from low precision, noisy passages, and limited reasoning. To address this, we propose ReAG, a novel Reasoning-Augmented Multimodal RAG approach that combines coarse- and fine-grained retrieval with a critic model that filters irrelevant passages, ensuring high-quality additional context. The model follows a multi-stage training strategy leveraging reinforcement learning to enhance reasoning over retrieved content, while supervised fine-tuning serves only as a cold start. Extensive experiments on Encyclopedic-VQA and InfoSeek demonstrate that ReAG significantly outperforms prior methods, improving answer accuracy and providing interpretable reasoning grounded in retrieved evidence. Our source code is publicly available at: https://github.com/aimagelab/ReAG.
Problem

Research questions and friction points this paper is trying to address.

Enhances knowledge-based visual question answering accuracy
Improves retrieval precision and reduces noisy passages
Augments reasoning over external knowledge with reinforcement learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Combines coarse and fine-grained retrieval with critic model
Uses reinforcement learning for multi-stage reasoning enhancement
Filters irrelevant passages to ensure high-quality context
πŸ”Ž Similar Papers
No similar papers found.