ViLaCD-R1: A Vision-Language Framework for Semantic Change Detection in Remote Sensing

📅 2025-12-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address key bottlenecks in remote sensing change detection—including weak semantic understanding, coarse spatial localization, ambiguous boundaries, and poor model interpretability—this paper proposes ViLaCD-R1, a two-stage vision-language framework. Methodologically, it introduces (1) a novel collaborative architecture integrating a multi-image reasoning module with a mask-guided decoder; (2) a hybrid training strategy combining supervised fine-tuning and reinforcement learning to enable vision-language model (VLM)-driven, block-level dual-temporal semantic reasoning; and (3) unified semantic-aware representation learning with pixel-accurate localization. Evaluated on multiple standard benchmarks, ViLaCD-R1 achieves state-of-the-art performance, significantly improving real semantic change detection accuracy, robustly suppressing non-semantic disturbances (e.g., illumination or sensor variations), and markedly enhancing boundary precision and model interpretability.

Technology Category

Application Category

📝 Abstract
Remote sensing change detection (RSCD), a complex multi-image inference task, traditionally uses pixel-based operators or encoder-decoder networks that inadequately capture high-level semantics and are vulnerable to non-semantic perturbations. Although recent multimodal and vision-language model (VLM)-based approaches enhance semantic understanding of change regions by incorporating textual descriptions, they still suffer from challenges such as inaccurate spatial localization, imprecise pixel-level boundary delineation, and limited interpretability. To address these issues, we propose ViLaCD-R1, a two-stage framework comprising a Multi-Image Reasoner (MIR) and a Mask-Guided Decoder (MGD). Specifically, the VLM is trained through supervised fine-tuning (SFT) and reinforcement learning (RL) on block-level dual-temporal inference tasks, taking dual-temporal image patches as input and outputting a coarse change mask. Then, the decoder integrates dual-temporal image features with this coarse mask to predict a precise binary change map. Comprehensive evaluations on multiple RSCD benchmarks demonstrate that ViLaCD-R1 substantially improves true semantic change recognition and localization, robustly suppresses non-semantic variations, and achieves state-of-the-art accuracy in complex real-world scenarios.
Problem

Research questions and friction points this paper is trying to address.

Improves semantic change recognition and localization in remote sensing
Suppresses non-semantic variations robustly in change detection
Enhances interpretability and accuracy in complex real-world scenarios
Innovation

Methods, ideas, or system contributions that make the work stand out.

Two-stage framework with Multi-Image Reasoner and Mask-Guided Decoder
VLM fine-tuned via supervised and reinforcement learning on dual-temporal patches
Decoder fuses image features with coarse mask for precise change map
🔎 Similar Papers
No similar papers found.
X
Xingwei Ma
Fudan University
Shiyang Feng
Shiyang Feng
Researcher
AI for Science
B
Bo Zhang
Shanghai Artificial Intelligence Laboratory
B
Bin Wang
Fudan University