Seeing Clearly without Training: Mitigating Hallucinations in Multimodal LLMs for Remote Sensing

📅 2026-03-03

📈 Citations: 0

✨ Influential: 0

career value

210K/year

🤖 AI Summary

This work addresses the severe hallucination issues in remote sensing visual question answering, where multimodal large language models often fail due to inaccurate visual grounding or misinterpretation of fine-grained small objects. To mitigate this, we propose RADAR, a training-free inference framework that leverages the model’s intrinsic attention mechanisms to enable progressive localization and fine-grained local reasoning during inference through a relative attention-driven active reasoning strategy. We also introduce RSHBench, the first benchmark specifically designed for fine-grained diagnosis of hallucinations in remote sensing. Experimental results demonstrate that RADAR significantly enhances question-answering performance across multiple multimodal large language models and effectively suppresses both factual and logical hallucinations.

Technology Category

Application Category

📝 Abstract

Multimodal large language models (MLLMs) suffer from pronounced hallucinations in remote sensing visual question-answering (RS-VQA), primarily caused by visual grounding failures in large-scale scenes or misinterpretation of fine-grained small targets. To systematically analyze these issues, we introduce RSHBench, a protocol-based benchmark for fine-grained diagnosis of factual and logical hallucinations. To mitigate grounding-induced factual hallucinations, we further propose Relative Attention-Driven Actively Reasoning (RADAR), a training-free inference method that leverages intrinsic attention in MLLMs to guide progressive localization and fine-grained local reasoning at test time. Extensive experiments across diverse MLLMs demonstrate that RADAR consistently improves RS-VQA performance and reduces both factual and logical hallucinations. Code and data will be publicly available at: https://github.com/MiliLab/RADAR

Problem

Research questions and friction points this paper is trying to address.

hallucinations

multimodal LLMs

remote sensing

visual question answering

visual grounding

Innovation

Methods, ideas, or system contributions that make the work stand out.

training-free inference

visual grounding

hallucination mitigation