🤖 AI Summary
This work addresses the cross-view synthesis problem from street-level to aerial imagery, where significant viewpoint discrepancies cause domain shift and dense urban occlusions severely limit visibility. To tackle these challenges, we propose a Curved-BEV perspective transformation mechanism coupled with a many-to-one geometric mapping strategy, enabling a BEV-guided conditional diffusion model that jointly ensures layout consistency, geometric plausibility, and photorealistic texture synthesis. We further introduce Ground2Aerial-3 (G2A-3), the first multi-scenario benchmark dataset dedicated to ground-to-aerial synthesis. Extensive evaluations on CVUSA, CVACT, VIGOR-Chicago, and G2A-3 demonstrate state-of-the-art performance, achieving substantial improvements in both content consistency and visual fidelity of generated aerial images compared to prior methods.
📝 Abstract
Ground-to-aerial image synthesis focuses on generating realistic aerial images from corresponding ground street view images while maintaining consistent content layout, simulating a top-down view. The significant viewpoint difference leads to domain gaps between views, and dense urban scenes limit the visible range of street views, making this cross-view generation task particularly challenging. In this paper, we introduce SkyDiffusion, a novel cross-view generation method for synthesizing aerial images from street view images, utilizing a diffusion model and the Bird's-Eye View (BEV) paradigm. The Curved-BEV method in SkyDiffusion converts street-view images into a BEV perspective, effectively bridging the domain gap, and employs a"multi-to-one"mapping strategy to address occlusion issues in dense urban scenes. Next, SkyDiffusion designed a BEV-guided diffusion model to generate content-consistent and realistic aerial images. Additionally, we introduce a novel dataset, Ground2Aerial-3, designed for diverse ground-to-aerial image synthesis applications, including disaster scene aerial synthesis, low-altitude UAV image synthesis, and historical high-resolution satellite image synthesis tasks. Experimental results demonstrate that SkyDiffusion outperforms state-of-the-art methods on cross-view datasets across natural (CVUSA), suburban (CVACT), urban (VIGOR-Chicago), and various application scenarios (G2A-3), achieving realistic and content-consistent aerial image generation. The code, datasets and more information of this work can be found at https://opendatalab.github.io/skydiffusion/ .