🤖 AI Summary
Existing multimodal RAG approaches for Visual Relation Detection (VRD) over-rely on salient textual and visual elements, neglecting fine-grained knowledge—such as small-font text and contextual cues—leading to incomplete retrieval and inaccurate answer generation. To address this, we propose SFT-RAG, the first framework to explicitly model both salient and fine-grained textual knowledge via a hybrid masking strategy. Additionally, we design an uncertainty-guided surrogate generator that dynamically fuses dual-path information flows. By integrating multimodal representation learning with dynamic knowledge integration, SFT-RAG achieves state-of-the-art performance on open-domain visual question answering: it attains top results under both zero-shot and supervised settings. Extensive experiments validate that explicit fine-grained knowledge modeling is critical for improving answer completeness and reliability.
📝 Abstract
Existing multimodal Retrieval-Augmented Generation (RAG) methods for visually rich documents (VRD) are often biased towards retrieving salient knowledge(e.g., prominent text and visual elements), while largely neglecting the critical fine-print knowledge(e.g., small text, contextual details). This limitation leads to incomplete retrieval and compromises the generator's ability to produce accurate and comprehensive answers. To bridge this gap, we propose HKRAG, a new holistic RAG framework designed to explicitly capture and integrate both knowledge types. Our framework features two key components: (1) a Hybrid Masking-based Holistic Retriever that employs explicit masking strategies to separately model salient and fine-print knowledge, ensuring a query-relevant holistic information retrieval; and (2) an Uncertainty-guided Agentic Generator that dynamically assesses the uncertainty of initial answers and actively decides how to integrate the two distinct knowledge streams for optimal response generation. Extensive experiments on open-domain visual question answering benchmarks show that HKRAG consistently outperforms existing methods in both zero-shot and supervised settings, demonstrating the critical importance of holistic knowledge retrieval for VRD understanding.