DeltaVLM: Interactive Remote Sensing Image Change Analysis via Instruction-guided Difference Perception

📅 2025-07-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing remote sensing change analysis methods provide only single-shot outputs or static descriptions, failing to support interactive, query-driven, fine-grained interpretation. To address this, we propose a novel instruction-driven paradigm for interactive bitemporal remote sensing image change analysis. We introduce ChangeChat-105k—the first large-scale instruction-tuning dataset for remote sensing change understanding—and design DeltaVLM, an end-to-end vision-language architecture integrating temporal visual encoding, a difference-aware module, cross-semantic relationship measurement (CSRM), and an instruction-guided Q-former, while freezing the large language model backbone for efficient adaptation. Our method enables the first multi-turn, instruction-based change reasoning in remote sensing. It achieves state-of-the-art performance on both single-turn descriptive and multi-turn dialog tasks, significantly outperforming existing remote sensing vision-language models and general-purpose multimodal foundation models.

Technology Category

Application Category

📝 Abstract
Accurate interpretation of land-cover changes in multi-temporal satellite imagery is critical for real-world scenarios. However, existing methods typically provide only one-shot change masks or static captions, limiting their ability to support interactive, query-driven analysis. In this work, we introduce remote sensing image change analysis (RSICA) as a new paradigm that combines the strengths of change detection and visual question answering to enable multi-turn, instruction-guided exploration of changes in bi-temporal remote sensing images. To support this task, we construct ChangeChat-105k, a large-scale instruction-following dataset, generated through a hybrid rule-based and GPT-assisted process, covering six interaction types: change captioning, classification, quantification, localization, open-ended question answering, and multi-turn dialogues. Building on this dataset, we propose DeltaVLM, an end-to-end architecture tailored for interactive RSICA. DeltaVLM features three innovations: (1) a fine-tuned bi-temporal vision encoder to capture temporal differences; (2) a visual difference perception module with a cross-semantic relation measuring (CSRM) mechanism to interpret changes; and (3) an instruction-guided Q-former to effectively extract query-relevant difference information from visual changes, aligning them with textual instructions. We train DeltaVLM on ChangeChat-105k using a frozen large language model, adapting only the vision and alignment modules to optimize efficiency. Extensive experiments and ablation studies demonstrate that DeltaVLM achieves state-of-the-art performance on both single-turn captioning and multi-turn interactive change analysis, outperforming existing multimodal large language models and remote sensing vision-language models. Code, dataset and pre-trained weights are available at https://github.com/hanlinwu/DeltaVLM.
Problem

Research questions and friction points this paper is trying to address.

Enables interactive analysis of land-cover changes in satellite images
Combines change detection with visual question answering for multi-turn exploration
Introduces a new dataset and model for instruction-guided change interpretation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Fine-tuned bi-temporal vision encoder for temporal differences
Visual difference perception module with CSRM mechanism
Instruction-guided Q-former for query-relevant difference extraction
🔎 Similar Papers
No similar papers found.
P
Pei Deng
School of Information Science and Technology, Beijing Foreign Studies University, Beijing 100875, China
W
Wenqian Zhou
School of Information Science and Technology, Beijing Foreign Studies University, Beijing 100875, China
Hanlin Wu
Hanlin Wu
Tsinghua University
Generative ModelsAI for Science