Taming a Retrieval Framework to Read Images in Humanlike Manner for Augmenting Generation of MLLMs

📅 2025-10-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Multimodal large language models (MLLMs) suffer from hallucinations—particularly concerning object identity, spatial location, and relational semantics—in fine-grained visual question answering (VQA), primarily due to the absence of explicit visual grounding in textual queries. Existing retrieval-augmented generation (RAG) approaches rely solely on global image features, neglecting local spatial details essential for fine-grained reasoning. To address this, we propose HuLiRAG, a novel hierarchical RAG framework featuring a “what–where–reweight” cascade: open-vocabulary detection identifies target objects (“what”), SAM-generated masks enable pixel-accurate spatial alignment (“where”), and a local-global reweighting mechanism enhances factual consistency (“reweight”). This architecture elevates visual localization from a passive bias to an active constraint. Experiments demonstrate that HuLiRAG significantly mitigates hallucination, achieving consistent accuracy gains and improved reasoning reliability across multiple fine-grained VQA benchmarks.

Technology Category

Application Category

📝 Abstract
Multimodal large language models (MLLMs) often fail in fine-grained visual question answering, producing hallucinations about object identities, positions, and relations because textual queries are not explicitly anchored to visual referents. Retrieval-augmented generation (RAG) alleviates some errors, but it fails to align with human-like processing at both the retrieval and augmentation levels. Specifically, it focuses only on global-level image information but lacks local detail and limits reasoning about fine-grained interactions. To overcome this limitation, we present Human-Like Retrieval-Augmented Generation (HuLiRAG), a framework that stages multimodal reasoning as a ``what--where--reweight'' cascade. Queries are first anchored to candidate referents via open-vocabulary detection (what), then spatially resolved with SAM-derived masks to recover fine-grained precision (where), and adaptively prioritized through the trade-off between local and global alignment (reweight). Mask-guided fine-tuning further injects spatial evidence into the generation process, transforming grounding from a passive bias into an explicit constraint on answer formulation. Extensive experiments demonstrate that this human-like cascade improves grounding fidelity and factual consistency while reducing hallucinations, advancing multimodal question answering toward trustworthy reasoning.
Problem

Research questions and friction points this paper is trying to address.

Improving fine-grained visual question answering in MLLMs
Reducing hallucinations about object identities and relations
Aligning retrieval with human-like multimodal reasoning processes
Innovation

Methods, ideas, or system contributions that make the work stand out.

Open-vocabulary detection anchors queries to referents
SAM-derived masks spatially resolve fine-grained details
Mask-guided fine-tuning injects spatial evidence explicitly
🔎 Similar Papers
No similar papers found.
S
Suyang Xi
Emory University, Atlanta, USA
C
Chenxi Yang
University of Electronic Science and Technology of China, Chengdu, China
Hong Ding
Hong Ding
Tsung-Dao Lee Institute, Shanghai Jiao Tong University
condensed matter physics
Y
Yiqing Ni
The Hong Kong Polytechnic University, Hong Kong, China
C
Catherine C. Liu
The Hong Kong Polytechnic University, Hong Kong, China
Yunhao Liu
Yunhao Liu
ACM Fellow, IEEE Fellow, CCF Fellow, Tsinghua University
Wireless Sensor Networks/RFIDCyber Physical Systems and IoTPrivacy and SecurityCloud Computing
Chengqi Zhang
Chengqi Zhang
Chair Professor of Artificial Intelligence
Data Mining