HKRAG: Holistic Knowledge Retrieval-Augmented Generation Over Visually-Rich Documents

📅 2025-11-25

📈 Citations: 0

✨ Influential: 0

career value

158K/year

🤖 AI Summary

Existing multimodal RAG approaches for Visual Relation Detection (VRD) over-rely on salient textual and visual elements, neglecting fine-grained knowledge—such as small-font text and contextual cues—leading to incomplete retrieval and inaccurate answer generation. To address this, we propose SFT-RAG, the first framework to explicitly model both salient and fine-grained textual knowledge via a hybrid masking strategy. Additionally, we design an uncertainty-guided surrogate generator that dynamically fuses dual-path information flows. By integrating multimodal representation learning with dynamic knowledge integration, SFT-RAG achieves state-of-the-art performance on open-domain visual question answering: it attains top results under both zero-shot and supervised settings. Extensive experiments validate that explicit fine-grained knowledge modeling is critical for improving answer completeness and reliability.

Technology Category

Application Category

📝 Abstract

Existing multimodal Retrieval-Augmented Generation (RAG) methods for visually rich documents (VRD) are often biased towards retrieving salient knowledge(e.g., prominent text and visual elements), while largely neglecting the critical fine-print knowledge(e.g., small text, contextual details). This limitation leads to incomplete retrieval and compromises the generator's ability to produce accurate and comprehensive answers. To bridge this gap, we propose HKRAG, a new holistic RAG framework designed to explicitly capture and integrate both knowledge types. Our framework features two key components: (1) a Hybrid Masking-based Holistic Retriever that employs explicit masking strategies to separately model salient and fine-print knowledge, ensuring a query-relevant holistic information retrieval; and (2) an Uncertainty-guided Agentic Generator that dynamically assesses the uncertainty of initial answers and actively decides how to integrate the two distinct knowledge streams for optimal response generation. Extensive experiments on open-domain visual question answering benchmarks show that HKRAG consistently outperforms existing methods in both zero-shot and supervised settings, demonstrating the critical importance of holistic knowledge retrieval for VRD understanding.

Problem

Research questions and friction points this paper is trying to address.

Existing VRD RAG methods neglect fine-print knowledge

Incomplete retrieval compromises answer accuracy and comprehensiveness

HKRAG captures both salient and fine-print knowledge holistically

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hybrid Masking-based Holistic Retriever for comprehensive knowledge

Separate modeling of salient and fine-print knowledge elements

Uncertainty-guided Agentic Generator integrating dual knowledge streams

🔎 Similar Papers

UniRAG: Universal Retrieval Augmentation for Large Vision Language Models