🤖 AI Summary
Existing remote sensing visual question answering (RS VQA) datasets suffer from coarse-grained annotations, limited question diversity, and insufficient coverage of complex reasoning capabilities. To address these limitations, we introduce RS-VQA-Bench—the first large-scale, high-fidelity RS VQA benchmark, comprising 13,820 remote sensing images and 162,373 high-quality question-answer pairs. We propose a dual-track automated annotation framework: (1) a GPT-4.1-driven multi-granularity prompting pipeline generating semantic descriptions, spatial relations, and natural-language QA pairs; and (2) a segmentation-guided track leveraging pixel-level masks from WHU and LoveDA to enable precise object localization and automatic generation of counting-based questions. RS-VQA-Bench covers six reasoning types—identification, counting, localization, comparison, causality, and cross-modal reasoning—exhibiting significantly richer annotation diversity than state-of-the-art benchmarks. Comprehensive evaluation across six leading vision-language models confirms its strong challenge level and effectiveness for model assessment.
📝 Abstract
Visual Question Answering (VQA) in remote sensing (RS) is pivotal for interpreting Earth observation data. However, existing RS VQA datasets are constrained by limitations in annotation richness, question diversity, and the assessment of specific reasoning capabilities. This paper introduces RSVLM-QA dataset, a new large-scale, content-rich VQA dataset for the RS domain. RSVLM-QA is constructed by integrating data from several prominent RS segmentation and detection datasets: WHU, LoveDA, INRIA, and iSAID. We employ an innovative dual-track annotation generation pipeline. Firstly, we leverage Large Language Models (LLMs), specifically GPT-4.1, with meticulously designed prompts to automatically generate a suite of detailed annotations including image captions, spatial relations, and semantic tags, alongside complex caption-based VQA pairs. Secondly, to address the challenging task of object counting in RS imagery, we have developed a specialized automated process that extracts object counts directly from the original segmentation data; GPT-4.1 then formulates natural language answers from these counts, which are paired with preset question templates to create counting QA pairs. RSVLM-QA comprises 13,820 images and 162,373 VQA pairs, featuring extensive annotations and diverse question types. We provide a detailed statistical analysis of the dataset and a comparison with existing RS VQA benchmarks, highlighting the superior depth and breadth of RSVLM-QA's annotations. Furthermore, we conduct benchmark experiments on Six mainstream Vision Language Models (VLMs), demonstrating that RSVLM-QA effectively evaluates and challenges the understanding and reasoning abilities of current VLMs in the RS domain. We believe RSVLM-QA will serve as a pivotal resource for the RS VQA and VLM research communities, poised to catalyze advancements in the field.