BTCChat: Advancing Remote Sensing Bi-temporal Change Captioning with Multimodal Large Language Model

📅 2025-09-06

📈 Citations: 0

✨ Influential: 0

career value

180K/year

🤖 AI Summary

Existing bitemporal remote sensing image understanding methods typically concatenate image pairs directly, failing to explicitly model temporal dynamics and spatial-semantic changes, thereby undermining visual-language alignment. To address this, we propose a multimodal large language model framework specifically designed for change understanding. Our approach introduces a Change Extraction module that explicitly captures inter-temporal feature discrepancies and spatial change patterns; incorporates a Prompt Augmentation mechanism to integrate contextual cues and enhance fine-grained spatial perception; and adopts a joint visual-language alignment pretraining strategy that supports both single-image parsing and bitemporal co-reasoning. Evaluated on change captioning and remote sensing visual question answering (VQA) tasks, our method achieves state-of-the-art performance, significantly improving both the accuracy and interpretability of temporal change semantics.

Technology Category

Application Category

📝 Abstract

Bi-temporal satellite imagery supports critical applications such as urban development monitoring and disaster assessment. Although powerful multimodal large language models (MLLMs) have been applied in bi-temporal change analysis, previous methods process image pairs through direct concatenation, inadequately modeling temporal correlations and spatial semantic changes. This deficiency hampers visual-semantic alignment in change understanding, thereby constraining the overall effectiveness of current approaches. To address this gap, we propose BTCChat, a multi-temporal MLLM with advanced bi-temporal change understanding capability. BTCChat supports bi-temporal change captioning and retains single-image interpretation capability. To better capture temporal features and spatial semantic changes in image pairs, we design a Change Extraction module. Moreover, to enhance the model's attention to spatial details, we introduce a Prompt Augmentation mechanism, which incorporates contextual clues into the prompt to enhance model performance. Experimental results demonstrate that BTCChat achieves state-of-the-art performance on change captioning and visual question answering tasks.

Problem

Research questions and friction points this paper is trying to address.

Improving bi-temporal satellite image change analysis

Enhancing temporal correlation and spatial semantic modeling

Advancing visual-semantic alignment in change captioning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Change Extraction module captures temporal correlations

Prompt Augregation mechanism enhances spatial details

Multimodal LLM processes bi-temporal satellite imagery

🔎 Similar Papers

No similar papers found.