Geospatial-Temporal Sensemaking of Remote Sensing Activity Detections with Multimodal Large Language Model

📅 2026-05-11

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

This study addresses the challenge of understanding temporal dynamics and performing semantic reasoning about geospatial targets under sparse satellite observations by proposing a paradigm that transcends conventional change detection. The authors introduce the SMART-HC-VQA dataset, which transforms Sentinel-2 remote sensing imagery, spatiotemporal annotations, and geospatial metadata into approximately 2.3 million temporally aligned visual question answering samples. They further propose an Image-Pairwise Combinatorial Augmentation strategy to synthesize multi-temporal training data. Building upon the LLaVA-NeXT Mistral-7B architecture, this work presents the first multimodal large language model capable of jointly processing multiple timestamped images, enabling state evolution modeling and trend inference for heavy construction activities, and establishing a reproducible benchmark for language-guided temporal remote sensing understanding.

📝 Abstract

We introduce SMART-HC-VQA, a Sentinel-2-based visual question answering dataset derived from the IARPA SMART Heavy Construction dataset, designed for spatiotemporal analysis of human activity. The dataset transforms construction-site annotations, construction-type labels, temporal-phase labels, geographic metadata, and observation relationships into natural language question-answer triplets. This approach redefines the existing dataset as a temporally extended automatic target recognition and visual question answering (VQA) challenge, considering a fixed geospatial site as a target whose attributes and activity states evolve across sparse satellite observations. Currently, SMART-HC-VQA comprises 21,837 accessible Sentinel-2 image chips, 65,511 single-image VQA examples, and approximately 2.3 million two-image temporal comparison examples generated via our novel Image-Pairwise Combinatorial Augmentation. We detail the workflow for retrieving and processing Sentinel-2 imagery, segmenting large satellite tiles into site-centered images, maintaining traceability to SMART-HC annotations, and analyzing the distributions of site size, observation count, temporal coverage, construction type, and phase labels. Additionally, we describe an implemented multi-image MLLM training framework based on LLaVA-NeXT Mistral-7B, adapted to accept multiple dated image inputs and train on metadata-derived VQA examples. This work offers a reproducible foundation for understanding language-guided remote sensing activities, aiming not only to detect change but also to reason about the ongoing processes, their progression, and potential future developments.

Problem

Research questions and friction points this paper is trying to address.

geospatial-temporal sensemaking

remote sensing

visual question answering

human activity detection

multimodal large language model

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal Large Language Model

Visual Question Answering

Geospatial-Temporal Reasoning