🤖 AI Summary
This study addresses the challenge of understanding temporal dynamics and performing semantic reasoning about geospatial targets under sparse satellite observations by proposing a paradigm that transcends conventional change detection. The authors introduce the SMART-HC-VQA dataset, which transforms Sentinel-2 remote sensing imagery, spatiotemporal annotations, and geospatial metadata into approximately 2.3 million temporally aligned visual question answering samples. They further propose an Image-Pairwise Combinatorial Augmentation strategy to synthesize multi-temporal training data. Building upon the LLaVA-NeXT Mistral-7B architecture, this work presents the first multimodal large language model capable of jointly processing multiple timestamped images, enabling state evolution modeling and trend inference for heavy construction activities, and establishing a reproducible benchmark for language-guided temporal remote sensing understanding.
📝 Abstract
We introduce SMART-HC-VQA, a Sentinel-2-based visual question answering dataset derived from the IARPA SMART Heavy Construction dataset, designed for spatiotemporal analysis of human activity. The dataset transforms construction-site annotations, construction-type labels, temporal-phase labels, geographic metadata, and observation relationships into natural language question-answer triplets. This approach redefines the existing dataset as a temporally extended automatic target recognition and visual question answering (VQA) challenge, considering a fixed geospatial site as a target whose attributes and activity states evolve across sparse satellite observations. Currently, SMART-HC-VQA comprises 21,837 accessible Sentinel-2 image chips, 65,511 single-image VQA examples, and approximately 2.3 million two-image temporal comparison examples generated via our novel Image-Pairwise Combinatorial Augmentation. We detail the workflow for retrieving and processing Sentinel-2 imagery, segmenting large satellite tiles into site-centered images, maintaining traceability to SMART-HC annotations, and analyzing the distributions of site size, observation count, temporal coverage, construction type, and phase labels. Additionally, we describe an implemented multi-image MLLM training framework based on LLaVA-NeXT Mistral-7B, adapted to accept multiple dated image inputs and train on metadata-derived VQA examples. This work offers a reproducible foundation for understanding language-guided remote sensing activities, aiming not only to detect change but also to reason about the ongoing processes, their progression, and potential future developments.