DPG-CD: Depth-Prior-Guided Cross-Modal Joint 2D-3D Change Detection

📅 2026-05-07

📈 Citations: 0

✨ Influential: 0

career value

183K/year

🤖 AI Summary

This study addresses the modality gap and false alarms in multitemporal cross-modal change detection—specifically between pre-disaster digital surface models (DSMs) and post-disaster optical imagery—caused by spectral-geometric representation discrepancies. To bridge this gap, the work introduces depth estimation priors for the first time and proposes a gated cross-modal fusion mechanism that effectively injects geometric cues into image features. Furthermore, it designs a multi-stage cross-temporal feature interaction module and a multi-task joint decoder to simultaneously predict 2D semantic changes, 3D height changes, and reconstruct the DSM, thereby enhancing structural consistency. Evaluated on Hi-BCD, 3DCD, and the newly introduced NYC-MMCD dataset, the method achieves state-of-the-art performance in both 2D and 3D change detection.

📝 Abstract

Urban spatial evolution is manifested not only through horizontal expansion but also through vertical structural changes. Consequently, jointly capturing 2D semantic changes and 3D height changes is essential for urban morphology analysis and emergency management. In practical scenarios, collecting 3D observations is often constrained by high acquisition costs and the inability to support frequent updates. The multi-temporal cross-modal input consisting of pre-event Digital Surface Model (DSM) and post-event imagery provides a practical solution for 3D change detection in high-frequency urban monitoring, disaster assessment, and emergency response scenarios. However, this setting remains challenging as imagery and DSM data exhibit significant spectral-geometric representation gaps. Moreover, modality differences may be confused with actual changes, and robust change detection requires effective fusion of semantic and geometric features from multi-temporal data. In this paper, we propose DPG-CD, a depth-prior-guided multi-temporal cross-modal fusion framework for joint 2D semantic and 3D height change detection. Specifically, an estimated depth prior is introduced into the imagery to mitigate the modality gap with DSM. A gated fusion mechanism then selectively injects geometric cues from depth prior while preserving discriminative spectral representations. Subsequently, a multi-stage cross-temporal cross-modal feature fusion architecture is employed to extract change-aware features. Finally, a multi-task decoder jointly predicts 2D semantic changes and 3D height changes, complemented by an auxiliary DSM prediction task to improve structural consistency and height estimation accuracy. Experiments on two public datasets, Hi-BCD and 3DCD, and a new dataset, NYC-MMCD, demonstrate that DPG-CD outperforms state-of-the-art methods on both 2D and 3D change detection tasks.

Problem

Research questions and friction points this paper is trying to address.

cross-modal

2D-3D change detection

urban morphology

modality gap

multi-temporal

Innovation

Methods, ideas, or system contributions that make the work stand out.

depth prior

cross-modal fusion

2D-3D change detection