๐ค AI Summary
To address severe occlusion, geometric incompleteness, high memory overhead, and poor edge-deployment capability in large-scale urban aerial reconstruction, this paper proposes a hybrid representation framework combining proxy building meshes with residual 3D Gaussians. Our method innovatively integrates multi-view stereo (MVS)-derived proxy geometry with depth-guided residual Gaussians, augmented by importance-aware downsampling and joint optimization. We further incorporate zero-order spherical harmonic lighting, image reprojection constraints, and a mobile-GPU-oriented lightweight design. Evaluated on real-world aerial datasets, our approach achieves a 1.4ร training speedup while significantly reducing GPU memory consumption and energy usage. Notably, it enables the first real-time rasterization-based rendering of complex urban scenes on consumer-grade mobile GPUsโovercoming fundamental limitations of 3D Gaussian splatting in dense modeling fidelity, prolonged training duration, and on-device adaptability.
๐ Abstract
Accurate and efficient modeling of large-scale urban scenes is critical for applications such as AR navigation, UAV based inspection, and smart city digital twins. While aerial imagery offers broad coverage and complements limitations of ground-based data, reconstructing city-scale environments from such views remains challenging due to occlusions, incomplete geometry, and high memory demands. Recent advances like 3D Gaussian Splatting (3DGS) improve scalability and visual quality but remain limited by dense primitive usage, long training times, and poor suit ability for edge devices. We propose CityGo, a hybrid framework that combines textured proxy geometry with residual and surrounding 3D Gaussians for lightweight, photorealistic rendering of urban scenes from aerial perspectives. Our approach first extracts compact building proxy meshes from MVS point clouds, then uses zero order SH Gaussians to generate occlusion-free textures via image-based rendering and back-projection. To capture high-frequency details, we introduce residual Gaussians placed based on proxy-photo discrepancies and guided by depth priors. Broader urban context is represented by surrounding Gaussians, with importance-aware downsampling applied to non-critical regions to reduce redundancy. A tailored optimization strategy jointly refines proxy textures and Gaussian parameters, enabling real-time rendering of complex urban scenes on mobile GPUs with significantly reduced training and memory requirements. Extensive experiments on real-world aerial datasets demonstrate that our hybrid representation significantly reduces training time, achieving on average 1.4x speedup, while delivering comparable visual fidelity to pure 3D Gaussian Splatting approaches. Furthermore, CityGo enables real-time rendering of large-scale urban scenes on mobile consumer GPUs, with substantially reduced memory usage and energy consumption.