🤖 AI Summary
This work addresses the limitation of existing remote sensing change detection methods, which can localize changed regions but lack the ability to provide natural language explanations of localized semantic changes. Current datasets predominantly offer image-level descriptions, hindering fine-grained understanding. To bridge this gap, we introduce RSRCC, the first fine-grained benchmark for region-based visual question answering on semantic changes in remote sensing imagery, comprising 126,000 questions requiring reasoning about specific semantic transformations. The benchmark leverages semantic segmentation to extract candidate regions and employs a pipeline integrating vision-language embedding for initial filtering, retrieval-augmented visual language modeling, and a Best-of-N ranking mechanism for efficient noise reduction and answer validation. With 87K training, 17.1K validation, and 22K test samples, RSRCC substantially advances the data foundation and evaluation capacity for semantic change understanding in remote sensing.
📝 Abstract
Traditional change detection identifies where changes occur, but does not explain what changed in natural language. Existing remote sensing change captioning datasets typically describe overall image-level differences, leaving fine-grained localized semantic reasoning largely unexplored. To close this gap, we present RSRCC, a new benchmark for remote sensing change question-answering containing 126k questions, split into 87k training, 17.1k validation, and 22k test instances. Unlike prior datasets, RSRCC is built around localized, change-specific questions that require reasoning about a particular semantic change. To the best of our knowledge, this is the first remote sensing change question-answering benchmark designed explicitly for such fine-grained reasoning-based supervision. To construct RSRCC, we introduce a hierarchical semi-supervised curation pipeline that uses Best-of-N ranking as a critical final ambiguity-resolution stage. First, candidate change regions are extracted from semantic segmentation masks, then initially screened using an image-text embedding model, and finally validated through retrieval-augmented vision-language curation with Best-of-N ranking. This process enables scalable filtering of noisy and ambiguous candidates while preserving semantically meaningful changes. The dataset is available at https://huggingface.co/datasets/google/RSRCC.