Taming a Retrieval Framework to Read Images in Humanlike Manner for Augmenting Generation of MLLMs

📅 2025-10-11

📈 Citations: 0

✨ Influential: 0

career value

182K/year

🤖 AI Summary

Multimodal large language models (MLLMs) suffer from hallucinations—particularly concerning object identity, spatial location, and relational semantics—in fine-grained visual question answering (VQA), primarily due to the absence of explicit visual grounding in textual queries. Existing retrieval-augmented generation (RAG) approaches rely solely on global image features, neglecting local spatial details essential for fine-grained reasoning. To address this, we propose HuLiRAG, a novel hierarchical RAG framework featuring a “what–where–reweight” cascade: open-vocabulary detection identifies target objects (“what”), SAM-generated masks enable pixel-accurate spatial alignment (“where”), and a local-global reweighting mechanism enhances factual consistency (“reweight”). This architecture elevates visual localization from a passive bias to an active constraint. Experiments demonstrate that HuLiRAG significantly mitigates hallucination, achieving consistent accuracy gains and improved reasoning reliability across multiple fine-grained VQA benchmarks.

Technology Category

Application Category

📝 Abstract

Multimodal large language models (MLLMs) often fail in fine-grained visual question answering, producing hallucinations about object identities, positions, and relations because textual queries are not explicitly anchored to visual referents. Retrieval-augmented generation (RAG) alleviates some errors, but it fails to align with human-like processing at both the retrieval and augmentation levels. Specifically, it focuses only on global-level image information but lacks local detail and limits reasoning about fine-grained interactions. To overcome this limitation, we present Human-Like Retrieval-Augmented Generation (HuLiRAG), a framework that stages multimodal reasoning as a ``what--where--reweight'' cascade. Queries are first anchored to candidate referents via open-vocabulary detection (what), then spatially resolved with SAM-derived masks to recover fine-grained precision (where), and adaptively prioritized through the trade-off between local and global alignment (reweight). Mask-guided fine-tuning further injects spatial evidence into the generation process, transforming grounding from a passive bias into an explicit constraint on answer formulation. Extensive experiments demonstrate that this human-like cascade improves grounding fidelity and factual consistency while reducing hallucinations, advancing multimodal question answering toward trustworthy reasoning.

Problem

Research questions and friction points this paper is trying to address.

Improving fine-grained visual question answering in MLLMs

Reducing hallucinations about object identities and relations

Aligning retrieval with human-like multimodal reasoning processes

Innovation

Methods, ideas, or system contributions that make the work stand out.

Open-vocabulary detection anchors queries to referents

SAM-derived masks spatially resolve fine-grained details

Mask-guided fine-tuning injects spatial evidence explicitly

🔎 Similar Papers

No similar papers found.