🤖 AI Summary
This work addresses the cross-view image generation problem from aerial to ground-level perspectives, tackling challenges arising from extreme viewpoint discrepancies, severe occlusions, and field-of-view limitations—leading to geometric distortions and semantic inconsistencies. We propose a height-aware dual-conditional diffusion model that jointly leverages spatial features encoded by a VAE and semantic embeddings from CLIP as complementary constraints in the denoising process—eliminating the need for explicit depth maps or 3D voxel representations. To our knowledge, this is the first approach to integrate such heterogeneous features for synergistic optimization of geometric plausibility and semantic fidelity. The method performs end-to-end cross-view synthesis and achieves an average 7.3% SSIM improvement over prior methods on CVUSA, CVACT, and Auto Arborist benchmarks. It notably enhances generation quality under both wide and narrow field-of-view conditions, demonstrating strong generalization and robustness.
📝 Abstract
Generating ground-level images from aerial views is a challenging task due to extreme viewpoint disparity, occlusions, and a limited field of view. We introduce Top2Ground, a novel diffusion-based method that directly generates photorealistic ground-view images from aerial input images without relying on intermediate representations such as depth maps or 3D voxels. Specifically, we condition the denoising process on a joint representation of VAE-encoded spatial features (derived from aerial RGB images and an estimated height map) and CLIP-based semantic embeddings. This design ensures the generation is both geometrically constrained by the scene's 3D structure and semantically consistent with its content. We evaluate Top2Ground on three diverse datasets: CVUSA, CVACT, and the Auto Arborist. Our approach shows 7.3% average improvement in SSIM across three benchmark datasets, showing Top2Ground can robustly handle both wide and narrow fields of view, highlighting its strong generalization capabilities.