🤖 AI Summary
This work addresses the challenges of multi-class discrimination, interpretability, and perceptual consistency in remote sensing video change detection. We propose the first end-to-end framework integrating instance-level prior guidance, hierarchical cross-attention diffusion modeling, and pixel-wise multi-class semantic classification. Methodologically: (1) Mask R-CNN extracts temporal instance masks of newly emerged objects as structural priors; (2) a hierarchical cross-attention mechanism guides the denoising process of a Denoising Diffusion Probabilistic Model (DDPM), jointly capturing local object details and global contextual dependencies; (3) an SSIM-based loss is introduced to explicitly enforce perceptual consistency in generated change maps. Evaluated on both synthetic and real-world remote sensing video datasets, our method achieves F1 and IoU improvements of 10–25 percentage points over state-of-the-art baselines—including discriminative methods, Siamese CNNs, and GAN-based approaches—establishing new SOTA performance for multi-class video change detection.
📝 Abstract
We present a unified change detection pipeline that combines instance level masking, multi-scale attention within a denoising diffusion model, and per pixel semantic classification, all refined via SSIM to match human perception. By first isolating only temporally novel objects with Mask R-CNN, then guiding diffusion updates through hierarchical cross attention to object and global contexts, and finally categorizing each pixel into one of C change types, our method delivers detailed, interpretable multi-class maps. It outperforms traditional differencing, Siamese CNNs, and GAN-based detectors by 10-25 points in F1 and IoU on both synthetic and real world benchmarks, marking a new state of the art in remote sensing change detection.