Measuring and Mitigating Hallucinations in Vision-Language Dataset Generation for Remote Sensing

📅 2025-01-24

📈 Citations: 0

✨ Influential: 0

career value

192K/year

🤖 AI Summary

In remote sensing, the scarcity of vision-language paired data leads to coarse image captions and frequent hallucinations in multimodal models. Method: This paper introduces a synthetic annotation framework that integrates geographic maps as external knowledge—first incorporating maps into remote sensing vision-language generation to enhance spatial semantic consistency; constructs fMoW-mm, the first multimodal benchmark supporting quadruple alignment of imagery, maps, metadata, and text; and proposes an interpretable hallucination detection and fact-checking–driven rewriting framework. The approach includes map-image geometric registration, cross-modal alignment modeling, LLM prompt engineering, and few-shot ATR fine-tuning. Contribution/Results: Experiments demonstrate significant improvements over baselines in few-shot object recognition: hallucination rate reduced by 37.2%, spatial description accuracy increased by 51.8%. The fMoW-mm dataset is publicly released to support downstream multi-task generalization.

Technology Category

Application Category

📝 Abstract

Vision language models have achieved impressive results across various fields. However, adoption in remote sensing remains limited, largely due to the scarcity of paired image-text data. To bridge this gap, synthetic caption generation has gained interest, traditionally relying on rule-based methods that use metadata or bounding boxes. While these approaches provide some description, they often lack the depth needed to capture complex wide-area scenes. Large language models (LLMs) offer a promising alternative for generating more descriptive captions, yet they can produce generic outputs and are prone to hallucination. In this paper, we propose a new method to enhance vision-language datasets for remote sensing by integrating maps as external data sources, enabling the generation of detailed, context-rich captions. Additionally, we present methods to measure and mitigate hallucinations in LLM-generated text. We introduce fMoW-mm, a multimodal dataset incorporating satellite imagery, maps, metadata, and text annotations. We demonstrate its effectiveness for automatic target recognition in few-shot settings, achieving superior performance compared to other vision-language remote sensing datasets.

Problem

Research questions and friction points this paper is trying to address.

Visual Language Models

Remote Sensing

Data Scarcity

Innovation

Methods, ideas, or system contributions that make the work stand out.

Map-Integrated Remote Sensing

Error Detection and Correction

fMoW-mm Dataset

🔎 Similar Papers

Hallucination of Multimodal Large Language Models: A Survey