🤖 AI Summary
Existing bitemporal remote sensing image understanding methods typically concatenate image pairs directly, failing to explicitly model temporal dynamics and spatial-semantic changes, thereby undermining visual-language alignment. To address this, we propose a multimodal large language model framework specifically designed for change understanding. Our approach introduces a Change Extraction module that explicitly captures inter-temporal feature discrepancies and spatial change patterns; incorporates a Prompt Augmentation mechanism to integrate contextual cues and enhance fine-grained spatial perception; and adopts a joint visual-language alignment pretraining strategy that supports both single-image parsing and bitemporal co-reasoning. Evaluated on change captioning and remote sensing visual question answering (VQA) tasks, our method achieves state-of-the-art performance, significantly improving both the accuracy and interpretability of temporal change semantics.
📝 Abstract
Bi-temporal satellite imagery supports critical applications such as urban development monitoring and disaster assessment. Although powerful multimodal large language models (MLLMs) have been applied in bi-temporal change analysis, previous methods process image pairs through direct concatenation, inadequately modeling temporal correlations and spatial semantic changes. This deficiency hampers visual-semantic alignment in change understanding, thereby constraining the overall effectiveness of current approaches. To address this gap, we propose BTCChat, a multi-temporal MLLM with advanced bi-temporal change understanding capability. BTCChat supports bi-temporal change captioning and retains single-image interpretation capability. To better capture temporal features and spatial semantic changes in image pairs, we design a Change Extraction module. Moreover, to enhance the model's attention to spatial details, we introduce a Prompt Augmentation mechanism, which incorporates contextual clues into the prompt to enhance model performance. Experimental results demonstrate that BTCChat achieves state-of-the-art performance on change captioning and visual question answering tasks.