๐ค AI Summary
Existing remote sensing change detection methods predominantly rely on unimodal visual features, neglecting textual semantic guidance, which limits their accuracy and robustness. To address this, we propose LG-CD, a language-guided change detection framework. LG-CD is the first to incorporate SAM2 as a multi-scale visual backbone for this task and introduces two novel components: a Text Fusion Attention Module (TFAM) and a cross-attention-based VisualโSemantic Fusion Decoder (V-SFD), enabling effective cross-modal alignment and fine-grained change localization. Furthermore, multi-layer adapters facilitate efficient parameter-efficient fine-tuning. Extensive experiments demonstrate that LG-CD achieves state-of-the-art performance on LEVIR-CD, WHU-CD, and SYSU-CD benchmarks, significantly improving both detection accuracy and generalization capability. This work establishes a new paradigm for generic multimodal remote sensing change detection.
๐ Abstract
Remote Sensing Change Detection (RSCD) typically identifies changes in land cover or surface conditions by analyzing multi-temporal images. Currently, most deep learning-based methods primarily focus on learning unimodal visual information, while neglecting the rich semantic information provided by multimodal data such as text. To address this limitation, we propose a novel Language-Guided Change Detection model (LG-CD). This model leverages natural language prompts to direct the network's attention to regions of interest, significantly improving the accuracy and robustness of change detection. Specifically, LG-CD utilizes a visual foundational model (SAM2) as a feature extractor to capture multi-scale pyramid features from high-resolution to low-resolution across bi-temporal remote sensing images. Subsequently, multi-layer adapters are employed to fine-tune the model for downstream tasks, ensuring its effectiveness in remote sensing change detection. Additionally, we design a Text Fusion Attention Module (TFAM) to align visual and textual information, enabling the model to focus on target change regions using text prompts. Finally, a Vision-Semantic Fusion Decoder (V-SFD) is implemented, which deeply integrates visual and semantic information through a cross-attention mechanism to produce highly accurate change detection masks. Our experiments on three datasets (LEVIR-CD, WHU-CD, and SYSU-CD) demonstrate that LG-CD consistently outperforms state-of-the-art change detection methods. Furthermore, our approach provides new insights into achieving generalized change detection by leveraging multimodal information.