LG-CD: Enhancing Language-Guided Change Detection through SAM2 Adaptation

📅 2025-09-26

📈 Citations: 0

✨ Influential: 0

career value

172K/year

🤖 AI Summary

Existing remote sensing change detection methods predominantly rely on unimodal visual features, neglecting textual semantic guidance, which limits their accuracy and robustness. To address this, we propose LG-CD, a language-guided change detection framework. LG-CD is the first to incorporate SAM2 as a multi-scale visual backbone for this task and introduces two novel components: a Text Fusion Attention Module (TFAM) and a cross-attention-based Visual–Semantic Fusion Decoder (V-SFD), enabling effective cross-modal alignment and fine-grained change localization. Furthermore, multi-layer adapters facilitate efficient parameter-efficient fine-tuning. Extensive experiments demonstrate that LG-CD achieves state-of-the-art performance on LEVIR-CD, WHU-CD, and SYSU-CD benchmarks, significantly improving both detection accuracy and generalization capability. This work establishes a new paradigm for generic multimodal remote sensing change detection.

Technology Category

Application Category

📝 Abstract

Remote Sensing Change Detection (RSCD) typically identifies changes in land cover or surface conditions by analyzing multi-temporal images. Currently, most deep learning-based methods primarily focus on learning unimodal visual information, while neglecting the rich semantic information provided by multimodal data such as text. To address this limitation, we propose a novel Language-Guided Change Detection model (LG-CD). This model leverages natural language prompts to direct the network's attention to regions of interest, significantly improving the accuracy and robustness of change detection. Specifically, LG-CD utilizes a visual foundational model (SAM2) as a feature extractor to capture multi-scale pyramid features from high-resolution to low-resolution across bi-temporal remote sensing images. Subsequently, multi-layer adapters are employed to fine-tune the model for downstream tasks, ensuring its effectiveness in remote sensing change detection. Additionally, we design a Text Fusion Attention Module (TFAM) to align visual and textual information, enabling the model to focus on target change regions using text prompts. Finally, a Vision-Semantic Fusion Decoder (V-SFD) is implemented, which deeply integrates visual and semantic information through a cross-attention mechanism to produce highly accurate change detection masks. Our experiments on three datasets (LEVIR-CD, WHU-CD, and SYSU-CD) demonstrate that LG-CD consistently outperforms state-of-the-art change detection methods. Furthermore, our approach provides new insights into achieving generalized change detection by leveraging multimodal information.

Problem

Research questions and friction points this paper is trying to address.

Enhancing change detection accuracy using language guidance and visual models

Integrating text prompts to focus on specific change regions in imagery

Aligning multimodal data through vision-semantic fusion for robust detection

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses SAM2 as visual feature extractor

Employs adapters for fine-tuning downstream tasks

Integrates vision-text fusion via cross-attention mechanism

🔎 Similar Papers

No similar papers found.