🤖 AI Summary
Multimodal large language models (MLLMs) suffer from hallucinations—particularly concerning object identity, spatial location, and relational semantics—in fine-grained visual question answering (VQA), primarily due to the absence of explicit visual grounding in textual queries. Existing retrieval-augmented generation (RAG) approaches rely solely on global image features, neglecting local spatial details essential for fine-grained reasoning. To address this, we propose HuLiRAG, a novel hierarchical RAG framework featuring a “what–where–reweight” cascade: open-vocabulary detection identifies target objects (“what”), SAM-generated masks enable pixel-accurate spatial alignment (“where”), and a local-global reweighting mechanism enhances factual consistency (“reweight”). This architecture elevates visual localization from a passive bias to an active constraint. Experiments demonstrate that HuLiRAG significantly mitigates hallucination, achieving consistent accuracy gains and improved reasoning reliability across multiple fine-grained VQA benchmarks.
📝 Abstract
Multimodal large language models (MLLMs) often fail in fine-grained visual question answering, producing hallucinations about object identities, positions, and relations because textual queries are not explicitly anchored to visual referents. Retrieval-augmented generation (RAG) alleviates some errors, but it fails to align with human-like processing at both the retrieval and augmentation levels. Specifically, it focuses only on global-level image information but lacks local detail and limits reasoning about fine-grained interactions. To overcome this limitation, we present Human-Like Retrieval-Augmented Generation (HuLiRAG), a framework that stages multimodal reasoning as a ``what--where--reweight'' cascade. Queries are first anchored to candidate referents via open-vocabulary detection (what), then spatially resolved with SAM-derived masks to recover fine-grained precision (where), and adaptively prioritized through the trade-off between local and global alignment (reweight). Mask-guided fine-tuning further injects spatial evidence into the generation process, transforming grounding from a passive bias into an explicit constraint on answer formulation. Extensive experiments demonstrate that this human-like cascade improves grounding fidelity and factual consistency while reducing hallucinations, advancing multimodal question answering toward trustworthy reasoning.