Top2Ground: A Height-Aware Dual Conditioning Diffusion Model for Robust Aerial-to-Ground View Generation

📅 2025-11-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the cross-view image generation problem from aerial to ground-level perspectives, tackling challenges arising from extreme viewpoint discrepancies, severe occlusions, and field-of-view limitations—leading to geometric distortions and semantic inconsistencies. We propose a height-aware dual-conditional diffusion model that jointly leverages spatial features encoded by a VAE and semantic embeddings from CLIP as complementary constraints in the denoising process—eliminating the need for explicit depth maps or 3D voxel representations. To our knowledge, this is the first approach to integrate such heterogeneous features for synergistic optimization of geometric plausibility and semantic fidelity. The method performs end-to-end cross-view synthesis and achieves an average 7.3% SSIM improvement over prior methods on CVUSA, CVACT, and Auto Arborist benchmarks. It notably enhances generation quality under both wide and narrow field-of-view conditions, demonstrating strong generalization and robustness.

Technology Category

Application Category

📝 Abstract
Generating ground-level images from aerial views is a challenging task due to extreme viewpoint disparity, occlusions, and a limited field of view. We introduce Top2Ground, a novel diffusion-based method that directly generates photorealistic ground-view images from aerial input images without relying on intermediate representations such as depth maps or 3D voxels. Specifically, we condition the denoising process on a joint representation of VAE-encoded spatial features (derived from aerial RGB images and an estimated height map) and CLIP-based semantic embeddings. This design ensures the generation is both geometrically constrained by the scene's 3D structure and semantically consistent with its content. We evaluate Top2Ground on three diverse datasets: CVUSA, CVACT, and the Auto Arborist. Our approach shows 7.3% average improvement in SSIM across three benchmark datasets, showing Top2Ground can robustly handle both wide and narrow fields of view, highlighting its strong generalization capabilities.
Problem

Research questions and friction points this paper is trying to address.

Generating ground-level images from aerial views with extreme viewpoint differences
Overcoming occlusions and limited field of view in aerial-to-ground conversion
Creating geometrically accurate and semantically consistent ground-view images
Innovation

Methods, ideas, or system contributions that make the work stand out.

Direct aerial-to-ground image generation using diffusion model
Joint conditioning with VAE spatial features and CLIP embeddings
Height-aware geometric constraints for 3D structure consistency
🔎 Similar Papers
No similar papers found.