🤖 AI Summary
To address key bottlenecks in remote sensing change detection—including weak semantic understanding, coarse spatial localization, ambiguous boundaries, and poor model interpretability—this paper proposes ViLaCD-R1, a two-stage vision-language framework. Methodologically, it introduces (1) a novel collaborative architecture integrating a multi-image reasoning module with a mask-guided decoder; (2) a hybrid training strategy combining supervised fine-tuning and reinforcement learning to enable vision-language model (VLM)-driven, block-level dual-temporal semantic reasoning; and (3) unified semantic-aware representation learning with pixel-accurate localization. Evaluated on multiple standard benchmarks, ViLaCD-R1 achieves state-of-the-art performance, significantly improving real semantic change detection accuracy, robustly suppressing non-semantic disturbances (e.g., illumination or sensor variations), and markedly enhancing boundary precision and model interpretability.
📝 Abstract
Remote sensing change detection (RSCD), a complex multi-image inference task, traditionally uses pixel-based operators or encoder-decoder networks that inadequately capture high-level semantics and are vulnerable to non-semantic perturbations. Although recent multimodal and vision-language model (VLM)-based approaches enhance semantic understanding of change regions by incorporating textual descriptions, they still suffer from challenges such as inaccurate spatial localization, imprecise pixel-level boundary delineation, and limited interpretability. To address these issues, we propose ViLaCD-R1, a two-stage framework comprising a Multi-Image Reasoner (MIR) and a Mask-Guided Decoder (MGD). Specifically, the VLM is trained through supervised fine-tuning (SFT) and reinforcement learning (RL) on block-level dual-temporal inference tasks, taking dual-temporal image patches as input and outputting a coarse change mask. Then, the decoder integrates dual-temporal image features with this coarse mask to predict a precise binary change map. Comprehensive evaluations on multiple RSCD benchmarks demonstrate that ViLaCD-R1 substantially improves true semantic change recognition and localization, robustly suppresses non-semantic variations, and achieves state-of-the-art accuracy in complex real-world scenarios.