🤖 AI Summary
To address the degraded generalization performance in large-scale outdoor novel view synthesis (NVS) caused by sparse training views, this work pioneers the extension of feed-forward PixelNeRF to the UrbanScene3D dataset. We propose Aug3D, a 3D data augmentation method that leverages SfM reconstruction to jointly perform mesh-based and semantics-guided sampling, generating novel views with high parallax coverage and geometric consistency—thereby alleviating the bottleneck of insufficient view overlap in the original data. Furthermore, we introduce a joint fine-tuning strategy integrating multi-scale view clustering with Aug3D-driven data augmentation. Experiments demonstrate a 10% PSNR improvement under halved per-cluster view count (20→10), and Aug3D significantly enhances rendering fidelity, establishing a new state-of-the-art for outdoor-scene NVS.
📝 Abstract
Recent photorealistic Novel View Synthesis (NVS) advances have increasingly gained attention. However, these approaches remain constrained to small indoor scenes. While optimization-based NVS models have attempted to address this, generalizable feed-forward methods, offering significant advantages, remain underexplored. In this work, we train PixelNeRF, a feed-forward NVS model, on the large-scale UrbanScene3D dataset. We propose four training strategies to cluster and train on this dataset, highlighting that performance is hindered by limited view overlap. To address this, we introduce Aug3D, an augmentation technique that leverages reconstructed scenes using traditional Structure-from-Motion (SfM). Aug3D generates well-conditioned novel views through grid and semantic sampling to enhance feed-forward NVS model learning. Our experiments reveal that reducing the number of views per cluster from 20 to 10 improves PSNR by 10%, but the performance remains suboptimal. Aug3D further addresses this by combining the newly generated novel views with the original dataset, demonstrating its effectiveness in improving the model's ability to predict novel views.