Towards Comprehensive Interactive Change Understanding in Remote Sensing: A Large-scale Dataset and Dual-granularity Enhanced VLM

📅 2025-09-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing remote sensing change understanding datasets lack deep interactive modeling capabilities for diverse change-related tasks—including description, classification, counting, and localization. To address this, we propose ChangeIMTI, the first large-scale, multi-task instruction-following dataset specifically designed for remote sensing change understanding. We further introduce ChangeVG, a dual-granularity enhanced vision-language model: its vision-guided dual-branch architecture jointly models fine-grained spatial features and high-level semantics, and is instruction-tuned atop foundation models (e.g., Qwen2.5-VL-7B) to achieve cross-modal alignment and multi-task joint reasoning. On change description, ChangeVG achieves a +1.39-point improvement in S*m over the strongest baseline, Semantic-CC; it also consistently outperforms prior methods across all four tasks. Ablation studies validate the effectiveness of both the dual-granularity perception module and the instruction-driven paradigm.

Technology Category

Application Category

📝 Abstract
Remote sensing change understanding (RSCU) is essential for analyzing remote sensing images and understanding how human activities affect the environment. However, existing datasets lack deep understanding and interactions in the diverse change captioning, counting, and localization tasks. To tackle these gaps, we construct ChangeIMTI, a new large-scale interactive multi-task instruction dataset that encompasses four complementary tasks including change captioning, binary change classification, change counting, and change localization. Building upon this new dataset, we further design a novel vision-guided vision-language model (ChangeVG) with dual-granularity awareness for bi-temporal remote sensing images (i.e., two remote sensing images of the same area at different times). The introduced vision-guided module is a dual-branch architecture that synergistically combines fine-grained spatial feature extraction with high-level semantic summarization. These enriched representations further serve as the auxiliary prompts to guide large vision-language models (VLMs) (e.g., Qwen2.5-VL-7B) during instruction tuning, thereby facilitating the hierarchical cross-modal learning. We extensively conduct experiments across four tasks to demonstrate the superiority of our approach. Remarkably, on the change captioning task, our method outperforms the strongest method Semantic-CC by 1.39 points on the comprehensive S*m metric, which integrates the semantic similarity and descriptive accuracy to provide an overall evaluation of change caption. Moreover, we also perform a series of ablation studies to examine the critical components of our method.
Problem

Research questions and friction points this paper is trying to address.

Addresses limited interaction in remote sensing change understanding tasks
Develops dual-granularity model for bi-temporal image analysis
Improves change captioning accuracy through vision-guided VLM enhancement
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dual-granularity vision-language model for bi-temporal images
Vision-guided dual-branch architecture for feature extraction
Auxiliary prompts enhance hierarchical cross-modal learning
🔎 Similar Papers
No similar papers found.
Junxiao Xue
Junxiao Xue
Zhejiang Lab
Computer GraphicsCrowd simulationMulti-agents ModelingMulti-modal Learning
Q
Quan Deng
Hangzhou Institute for Advanced Study, University of Chinese Academy of Sciences, Hangzhou, 310024, China
X
Xuecheng Wu
School of Computer Science and Technology, Xi’an Jiaotong University, Xi’an, 710049, China
K
Kelu Yao
Research Center for Space Computing System, Zhejiang Lab, Hangzhou, 311100, China
X
Xinyi Yin
School of Cyber Science and Engineering, Zhengzhou University, Zhengzhou, 450002, China
F
Fei Yu
Research Center for Space Computing System, Zhejiang Lab, Hangzhou, 311100, China
W
Wei Zhou
School of Computer Science and Informatics, Cardiff University, Cardiff, CF24 4AG, United Kingdom
Yanfei Zhong
Yanfei Zhong
Full Professor, RSIDEA, LIESMARS, Wuhan University, China
hyperspectralhigh spatial resolutionremote sensingimage processingcomputational intelligence
Y
Yang Liu
Department of Computer Science, The University of Toronto, Toronto, ON M5S 1A1, Canada
Dingkang Yang
Dingkang Yang
ByteDance
Multimodal LearningGenerative AIEmbodied AI