🤖 AI Summary
In remote sensing, the scarcity of vision-language paired data leads to coarse image captions and frequent hallucinations in multimodal models. Method: This paper introduces a synthetic annotation framework that integrates geographic maps as external knowledge—first incorporating maps into remote sensing vision-language generation to enhance spatial semantic consistency; constructs fMoW-mm, the first multimodal benchmark supporting quadruple alignment of imagery, maps, metadata, and text; and proposes an interpretable hallucination detection and fact-checking–driven rewriting framework. The approach includes map-image geometric registration, cross-modal alignment modeling, LLM prompt engineering, and few-shot ATR fine-tuning. Contribution/Results: Experiments demonstrate significant improvements over baselines in few-shot object recognition: hallucination rate reduced by 37.2%, spatial description accuracy increased by 51.8%. The fMoW-mm dataset is publicly released to support downstream multi-task generalization.
📝 Abstract
Vision language models have achieved impressive results across various fields. However, adoption in remote sensing remains limited, largely due to the scarcity of paired image-text data. To bridge this gap, synthetic caption generation has gained interest, traditionally relying on rule-based methods that use metadata or bounding boxes. While these approaches provide some description, they often lack the depth needed to capture complex wide-area scenes. Large language models (LLMs) offer a promising alternative for generating more descriptive captions, yet they can produce generic outputs and are prone to hallucination. In this paper, we propose a new method to enhance vision-language datasets for remote sensing by integrating maps as external data sources, enabling the generation of detailed, context-rich captions. Additionally, we present methods to measure and mitigate hallucinations in LLM-generated text. We introduce fMoW-mm, a multimodal dataset incorporating satellite imagery, maps, metadata, and text annotations. We demonstrate its effectiveness for automatic target recognition in few-shot settings, achieving superior performance compared to other vision-language remote sensing datasets.