π€ AI Summary
This work addresses the limitations of existing remote sensing change detection methods: unimodal approaches are often disrupted by semantically irrelevant visual variations, while multimodal methods suffer from coarse, incomplete, or noisy textual supervision. To overcome these issues, the authors propose the S2M framework, which, for the first time, automatically extracts structured βwhere-what-how-how manyβ quadruplet textual descriptions directly from standard binary change mask labels, providing precise, dense, and noise-free multimodal supervision at zero additional annotation cost. The framework achieves deep alignment between visual and textual features through structured text template generation, a two-stage training strategy, a bidirectional contrastive loss decoder, and domain-adaptive representation learning. Evaluated on the newly introduced Gaza-Change-v2 dataset, S2M achieves a Sek of 17.80% and an F_scd of 66.14%, outperforming multimodal methods that rely on large language models.
π Abstract
Remote sensing change detection is pivotal for urban monitoring, disaster assessment, and environmental resource management. Yet, unimodal deep learning methods frequently confuse genuine semantic changes with visually similar but irrelevant variations. Recent multimodal approaches incorporate text as auxiliary supervision, but their descriptions are either semantically coarse and unstructured or model-generated and thus noisy. Critically, all of them overlook a simple fact: fine-grained change semantics are already implicitly encoded in the ground-truth mask labels that come standard with every change detection dataset. These masks know where the change happened, what the land-cover types were before and after, how the transition occurred, and how many objects were involved. In this paper, we propose S2M, a framework that obtains structured textual features directly from change labels at zero additional annotation cost. Specifically, each change region is automatically transcribed into a semantic quadruple (where, what, how, how many) and converted into several fixed-template text descriptions, providing precise, dense, and noise-free multimodal supervision. We adopts a two-stage training strategy to fine-tune on remote sensing imagery firstly for robust domain-specific representation, after which a multimodal decoder with a bi-directional contrastive loss is introduced to achieve deep alignment between visual features and structured textual embeddings. To validate our method, we construct Gaza-Change-v2, a new multi-class change detection (MCD) dataset about the Gaza Strip. On this MCD dataset, S2M achieves a Sek of 17.80\% and an F$_{\text{scd}}$ of 66.14\%, notably surpassing even multimodal methods that leverage large language models. Our work demonstrates that masks can indeed talk. They tell us exactly what, where, how, and how many changes have occurred.