๐ค AI Summary
Satellite-to-street-view image synthesis faces significant challenges due to cross-view and cross-modal discrepancies in appearance and geometry. This work presents a systematic survey of state-of-the-art methods and identifies three critical gaps: (1) insufficient adaptation to modern architectures such as Transformers and diffusion models; (2) absence of large-scale, multi-city, multi-season, semantically annotated, multi-source public datasets; and (3) lack of scene-aware, task-specific evaluation metrics. To address these, we propose a unified generative framework integrating GANs, diffusion models, and VAEsโfeaturing multi-scale feature alignment and explicit geometric prior modeling. Experiments reveal that existing approaches suffer from outdated network designs, yielding limited detail fidelity and scene diversity. Our framework establishes a reproducible technical pathway and benchmarking infrastructure for high-fidelity, generalizable street-view synthesis. (149 words)
๐ Abstract
In recent years, street view imagery has grown to become one of the most important sources of geospatial data collection and urban analytics, which facilitates generating meaningful insights and assisting in decision-making. Synthesizing a street-view image from its corresponding satellite image is a challenging task due to the significant differences in appearance and viewpoint between the two domains. In this study, we screened 20 recent research papers to provide a thorough review of the state-of-the-art of how street-view images are synthesized from their corresponding satellite counterparts. The main findings are: (i) novel deep learning techniques are required for synthesizing more realistic and accurate street-view images; (ii) more datasets need to be collected for public usage; and (iii) more specific evaluation metrics need to be investigated for evaluating the generated images appropriately. We conclude that, due to applying outdated deep learning techniques, the recent literature failed to generate detailed and diverse street-view images.