BTCChat: Advancing Remote Sensing Bi-temporal Change Captioning with Multimodal Large Language Model

📅 2025-09-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing bitemporal remote sensing image understanding methods typically concatenate image pairs directly, failing to explicitly model temporal dynamics and spatial-semantic changes, thereby undermining visual-language alignment. To address this, we propose a multimodal large language model framework specifically designed for change understanding. Our approach introduces a Change Extraction module that explicitly captures inter-temporal feature discrepancies and spatial change patterns; incorporates a Prompt Augmentation mechanism to integrate contextual cues and enhance fine-grained spatial perception; and adopts a joint visual-language alignment pretraining strategy that supports both single-image parsing and bitemporal co-reasoning. Evaluated on change captioning and remote sensing visual question answering (VQA) tasks, our method achieves state-of-the-art performance, significantly improving both the accuracy and interpretability of temporal change semantics.

Technology Category

Application Category

📝 Abstract
Bi-temporal satellite imagery supports critical applications such as urban development monitoring and disaster assessment. Although powerful multimodal large language models (MLLMs) have been applied in bi-temporal change analysis, previous methods process image pairs through direct concatenation, inadequately modeling temporal correlations and spatial semantic changes. This deficiency hampers visual-semantic alignment in change understanding, thereby constraining the overall effectiveness of current approaches. To address this gap, we propose BTCChat, a multi-temporal MLLM with advanced bi-temporal change understanding capability. BTCChat supports bi-temporal change captioning and retains single-image interpretation capability. To better capture temporal features and spatial semantic changes in image pairs, we design a Change Extraction module. Moreover, to enhance the model's attention to spatial details, we introduce a Prompt Augmentation mechanism, which incorporates contextual clues into the prompt to enhance model performance. Experimental results demonstrate that BTCChat achieves state-of-the-art performance on change captioning and visual question answering tasks.
Problem

Research questions and friction points this paper is trying to address.

Improving bi-temporal satellite image change analysis
Enhancing temporal correlation and spatial semantic modeling
Advancing visual-semantic alignment in change captioning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Change Extraction module captures temporal correlations
Prompt Augregation mechanism enhances spatial details
Multimodal LLM processes bi-temporal satellite imagery
🔎 Similar Papers
No similar papers found.
Y
Yujie Li
State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications, Beijing, China
W
Wenjia Xu
State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications, Beijing, China
Y
Yuanben Zhang
Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing, China
Z
Zhiwei Wei
School of Geographical Sciences, Hunan Normal University, Hunan Changsha, China
Mugen Peng
Mugen Peng
Beijing University of Posts & Telecommun., IEEE Fellow, Web of Science Highly Cited Researcher
Fog ComputingCloud Radio Access NetworksIntegrated Satellite-Terrestrial Networks6G