🤖 AI Summary
Persistent thick cloud cover causes extensive spatial gaps in urban land surface temperature (LST) remote sensing imagery, while existing reconstruction methods either rely on multi-temporal/multi-source data or lack robustness under severe cloud contamination. Method: We propose a purely spatial, single-epoch LST reconstruction framework that introduces conditional denoising diffusion models for the first time in LST imputation. The model integrates static geospatial priors—including built-up area maps and digital elevation models—and incorporates a pixel-level supervised fine-tuning mechanism to strictly preserve consistency with observed cloud-free pixels. A U-Net backbone and synthetic cloud masking augmentation further enhance generalization. Results: Under 85% cloud coverage, our method achieves SSIM = 0.89, RMSE = 1.2 K, and R² = 0.84—significantly outperforming conventional interpolation methods in robustness and accuracy, thereby demonstrating strong capability for reconstructing large contiguous cloud-obscured regions.
📝 Abstract
Satellite-derived Land Surface Temperature (LST) products are central to surface urban heat island (SUHI) monitoring due to their consistent grid-based coverage over large metropolitan regions. However, cloud contamination frequently obscures LST observations, limiting their usability for continuous SUHI analysis. Most existing LST reconstruction methods rely on multitemporal information or multisensor data fusion, requiring auxiliary observations that may be unavailable or unreliable under persistent cloud cover. Purely spatial gap-filling approaches offer an alternative, but traditional statistical methods degrade under large or spatially contiguous gaps, while many deep learning based spatial models deteriorate rapidly with increasing missingness.
Recent advances in denoising diffusion based image inpainting models have demonstrated improved robustness under high missingness, motivating their adoption for spatial LST reconstruction. In this work, we introduce UrbanDIFF, a purely spatial denoising diffusion model for reconstructing cloud contaminated urban LST imagery. The model is conditioned on static urban structure information, including built-up surface data and a digital elevation model, and enforces strict consistency with revealed cloud free pixels through a supervised pixel guided refinement step during inference.
UrbanDIFF is trained and evaluated using NASA MODIS Terra LST data from seven major United States metropolitan areas spanning 2002 to 2025. Experiments using synthetic cloud masks with 20 to 85 percent coverage show that UrbanDIFF consistently outperforms an interpolation baseline, particularly under dense cloud occlusion, achieving SSIM of 0.89, RMSE of 1.2 K, and R2 of 0.84 at 85 percent cloud coverage, while exhibiting slower performance degradation as cloud density increases.