🤖 AI Summary
Existing diffusion-based satellite image generation methods suffer from insufficient environmental context utilization, poor robustness to missing/corrupted data, and difficulty in aligning with user intent. To address these challenges, we propose an environment-aware multimodal diffusion generation framework that— for the first time—explicitly incorporates dynamic environmental conditions (e.g., weather, illumination) as controllable conditioning signals. We further design a metadata fusion strategy to jointly model synergistic interactions among textual descriptions, structured metadata, and reference visual features, enabling stable generation under partial input absence. Evaluated on both single-image and time-series satellite image synthesis tasks, our method outperforms state-of-the-art approaches across six quantitative metrics. It achieves significant improvements in image fidelity, land-cover accuracy, and environmental consistency, while demonstrating superior robustness to metadata incompleteness.
📝 Abstract
Diffusion-based foundation models have recently garnered much attention in the field of generative modeling due to their ability to generate images of high quality and fidelity. Although not straightforward, their recent application to the field of remote sensing signaled the first successful trials towards harnessing the large volume of publicly available datasets containing multimodal information. Despite their success, existing methods face considerable limitations: they rely on limited environmental context, struggle with missing or corrupted data, and often fail to reliably reflect user intentions in generated outputs. In this work, we propose a novel diffusion model conditioned on environmental context, that is able to generate satellite images by conditioning from any combination of three different control signals: a) text, b) metadata, and c) visual data. In contrast to previous works, the proposed method is i) to our knowledge, the first of its kind to condition satellite image generation on dynamic environmental conditions as part of its control signals, and ii) incorporating a metadata fusion strategy that models attribute embedding interactions to account for partially corrupt and/or missing observations. Our method outperforms previous methods both qualitatively (robustness to missing metadata, higher responsiveness to control inputs) and quantitatively (higher fidelity, accuracy, and quality of generations measured using 6 different metrics) in the trials of single-image and temporal generation. The reported results support our hypothesis that conditioning on environmental context can improve the performance of foundation models for satellite imagery, and render our model a promising candidate for usage in downstream tasks. The collected 3-modal dataset is to our knowledge, the first publicly-available dataset to combine data from these three different mediums.