π€ AI Summary
This work addresses the severe hallucination issues in remote sensing visual question answering, where multimodal large language models often fail due to inaccurate visual grounding or misinterpretation of fine-grained small objects. To mitigate this, we propose RADAR, a training-free inference framework that leverages the modelβs intrinsic attention mechanisms to enable progressive localization and fine-grained local reasoning during inference through a relative attention-driven active reasoning strategy. We also introduce RSHBench, the first benchmark specifically designed for fine-grained diagnosis of hallucinations in remote sensing. Experimental results demonstrate that RADAR significantly enhances question-answering performance across multiple multimodal large language models and effectively suppresses both factual and logical hallucinations.
π Abstract
Multimodal large language models (MLLMs) suffer from pronounced hallucinations in remote sensing visual question-answering (RS-VQA), primarily caused by visual grounding failures in large-scale scenes or misinterpretation of fine-grained small targets. To systematically analyze these issues, we introduce RSHBench, a protocol-based benchmark for fine-grained diagnosis of factual and logical hallucinations. To mitigate grounding-induced factual hallucinations, we further propose Relative Attention-Driven Actively Reasoning (RADAR), a training-free inference method that leverages intrinsic attention in MLLMs to guide progressive localization and fine-grained local reasoning at test time. Extensive experiments across diverse MLLMs demonstrate that RADAR consistently improves RS-VQA performance and reduces both factual and logical hallucinations. Code and data will be publicly available at: https://github.com/MiliLab/RADAR