RSVLM-QA: A Benchmark Dataset for Remote Sensing Vision Language Model-based Question Answering

📅 2025-08-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing remote sensing visual question answering (RS VQA) datasets suffer from coarse-grained annotations, limited question diversity, and insufficient coverage of complex reasoning capabilities. To address these limitations, we introduce RS-VQA-Bench—the first large-scale, high-fidelity RS VQA benchmark, comprising 13,820 remote sensing images and 162,373 high-quality question-answer pairs. We propose a dual-track automated annotation framework: (1) a GPT-4.1-driven multi-granularity prompting pipeline generating semantic descriptions, spatial relations, and natural-language QA pairs; and (2) a segmentation-guided track leveraging pixel-level masks from WHU and LoveDA to enable precise object localization and automatic generation of counting-based questions. RS-VQA-Bench covers six reasoning types—identification, counting, localization, comparison, causality, and cross-modal reasoning—exhibiting significantly richer annotation diversity than state-of-the-art benchmarks. Comprehensive evaluation across six leading vision-language models confirms its strong challenge level and effectiveness for model assessment.

Technology Category

Application Category

📝 Abstract
Visual Question Answering (VQA) in remote sensing (RS) is pivotal for interpreting Earth observation data. However, existing RS VQA datasets are constrained by limitations in annotation richness, question diversity, and the assessment of specific reasoning capabilities. This paper introduces RSVLM-QA dataset, a new large-scale, content-rich VQA dataset for the RS domain. RSVLM-QA is constructed by integrating data from several prominent RS segmentation and detection datasets: WHU, LoveDA, INRIA, and iSAID. We employ an innovative dual-track annotation generation pipeline. Firstly, we leverage Large Language Models (LLMs), specifically GPT-4.1, with meticulously designed prompts to automatically generate a suite of detailed annotations including image captions, spatial relations, and semantic tags, alongside complex caption-based VQA pairs. Secondly, to address the challenging task of object counting in RS imagery, we have developed a specialized automated process that extracts object counts directly from the original segmentation data; GPT-4.1 then formulates natural language answers from these counts, which are paired with preset question templates to create counting QA pairs. RSVLM-QA comprises 13,820 images and 162,373 VQA pairs, featuring extensive annotations and diverse question types. We provide a detailed statistical analysis of the dataset and a comparison with existing RS VQA benchmarks, highlighting the superior depth and breadth of RSVLM-QA's annotations. Furthermore, we conduct benchmark experiments on Six mainstream Vision Language Models (VLMs), demonstrating that RSVLM-QA effectively evaluates and challenges the understanding and reasoning abilities of current VLMs in the RS domain. We believe RSVLM-QA will serve as a pivotal resource for the RS VQA and VLM research communities, poised to catalyze advancements in the field.
Problem

Research questions and friction points this paper is trying to address.

Addresses limitations in remote sensing VQA datasets' annotation richness and diversity
Introduces a large-scale dataset to evaluate VLM reasoning in RS imagery
Automates annotation generation for complex QA pairs in Earth observation data
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses GPT-4.1 for automated annotation generation
Integrates multiple RS datasets for diverse content
Specialized automated process for object counting
🔎 Similar Papers
No similar papers found.
Xing Zi
Xing Zi
Researcher, University of Technology Sydney
Computer VisionRemote SensingMultimodal
J
Jinghao Xiao
School of Computer Science, University of Technology Sydney
Y
Yunxiao Shi
SEDE, University of Technology Sydney
X
Xian Tao
Institute of Automation, Chinese Academy of Sciences
J
Jun Li
School of Computer Science, University of Technology Sydney
Ali Braytee
Ali Braytee
University of Technology Sydney
machine learningoptimizationdata miningcomputational biology
M
Mukesh Prasad
School of Computer Science, University of Technology Sydney