🤖 AI Summary
Existing neural rendering methods excel only at interpolating novel views near trained camera trajectories in urban scene reconstruction, exhibiting poor generalization to out-of-distribution viewpoints (e.g., top-down or side views), thus limiting practical deployment. To address this, we introduce extrapolative view synthesis (EVS)—a new task requiring robust rendering from radically novel perspectives. We propose a hierarchical semantic-geometric prior framework that jointly leverages voxel-level scene modeling and instance-level 3D bounding box priors from urban object detection. Our HSG-VSD algorithm implements variational score distillation under joint semantic and geometric constraints, integrating occupancy grids, pre-trained UrbanCraft2D guidance, and neural rendering. Evaluated on a dedicated EVS benchmark, our method surpasses state-of-the-art approaches by +12.6 dB PSNR. Qualitative results demonstrate strong generalization to large viewpoint deviations and challenging conditions (e.g., text blurring), while preserving high-fidelity geometric and appearance details.
📝 Abstract
Existing neural rendering-based urban scene reconstruction methods mainly focus on the Interpolated View Synthesis (IVS) setting that synthesizes from views close to training camera trajectory. However, IVS can not guarantee the on-par performance of the novel view outside the training camera distribution ( extit{e.g.}, looking left, right, or downwards), which limits the generalizability of the urban reconstruction application. Previous methods have optimized it via image diffusion, but they fail to handle text-ambiguous or large unseen view angles due to coarse-grained control of text-only diffusion. In this paper, we design UrbanCraft, which surmounts the Extrapolated View Synthesis (EVS) problem using hierarchical sem-geometric representations serving as additional priors. Specifically, we leverage the partially observable scene to reconstruct coarse semantic and geometric primitives, establishing a coarse scene-level prior through an occupancy grid as the base representation. Additionally, we incorporate fine instance-level priors from 3D bounding boxes to enhance object-level details and spatial relationships. Building on this, we propose the extbf{H}ierarchical extbf{S}emantic-Geometric- extbf{G}uided Variational Score Distillation (HSG-VSD), which integrates semantic and geometric constraints from pretrained UrbanCraft2D into the score distillation sampling process, forcing the distribution to be consistent with the observable scene. Qualitative and quantitative comparisons demonstrate the effectiveness of our methods on EVS problem.