ViLaCD-R1: A Vision-Language Framework for Semantic Change Detection in Remote Sensing

📅 2025-12-29

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

To address key bottlenecks in remote sensing change detection—including weak semantic understanding, coarse spatial localization, ambiguous boundaries, and poor model interpretability—this paper proposes ViLaCD-R1, a two-stage vision-language framework. Methodologically, it introduces (1) a novel collaborative architecture integrating a multi-image reasoning module with a mask-guided decoder; (2) a hybrid training strategy combining supervised fine-tuning and reinforcement learning to enable vision-language model (VLM)-driven, block-level dual-temporal semantic reasoning; and (3) unified semantic-aware representation learning with pixel-accurate localization. Evaluated on multiple standard benchmarks, ViLaCD-R1 achieves state-of-the-art performance, significantly improving real semantic change detection accuracy, robustly suppressing non-semantic disturbances (e.g., illumination or sensor variations), and markedly enhancing boundary precision and model interpretability.

Technology Category

Application Category

📝 Abstract

Remote sensing change detection (RSCD), a complex multi-image inference task, traditionally uses pixel-based operators or encoder-decoder networks that inadequately capture high-level semantics and are vulnerable to non-semantic perturbations. Although recent multimodal and vision-language model (VLM)-based approaches enhance semantic understanding of change regions by incorporating textual descriptions, they still suffer from challenges such as inaccurate spatial localization, imprecise pixel-level boundary delineation, and limited interpretability. To address these issues, we propose ViLaCD-R1, a two-stage framework comprising a Multi-Image Reasoner (MIR) and a Mask-Guided Decoder (MGD). Specifically, the VLM is trained through supervised fine-tuning (SFT) and reinforcement learning (RL) on block-level dual-temporal inference tasks, taking dual-temporal image patches as input and outputting a coarse change mask. Then, the decoder integrates dual-temporal image features with this coarse mask to predict a precise binary change map. Comprehensive evaluations on multiple RSCD benchmarks demonstrate that ViLaCD-R1 substantially improves true semantic change recognition and localization, robustly suppresses non-semantic variations, and achieves state-of-the-art accuracy in complex real-world scenarios.

Problem

Research questions and friction points this paper is trying to address.

Improves semantic change recognition and localization in remote sensing

Suppresses non-semantic variations robustly in change detection

Enhances interpretability and accuracy in complex real-world scenarios

Innovation

Methods, ideas, or system contributions that make the work stand out.

Two-stage framework with Multi-Image Reasoner and Mask-Guided Decoder

VLM fine-tuned via supervised and reinforcement learning on dual-temporal patches

Decoder fuses image features with coarse mask for precise change map

🔎 Similar Papers

No similar papers found.