🤖 AI Summary
Reconstructing outdoor 3D scenes from sparse, pose-free internet images is challenged by complex lighting conditions and transient occlusions, and existing approaches typically rely on per-scene optimization, limiting their generalization. This work proposes GenWildSplat—the first feed-forward framework that requires no test-time optimization—leveraging geometric priors to directly predict depth, camera parameters, and 3D Gaussians in a canonical space. It further incorporates an appearance adapter and semantic segmentation to handle illumination variations and dynamic objects. Evaluated on PhotoTourism and MegaScenes, GenWildSplat achieves state-of-the-art feed-forward rendering quality, enables real-time inference, and significantly enhances generalization across diverse scenes, lighting conditions, and occlusion scenarios.
📝 Abstract
Reconstructing 3D scenes from sparse, unposed images remains challenging under real-world conditions with varying illumination and transient occlusions. Existing methods rely on scene-specific optimization using appearance embeddings or dynamic masks, which requires extensive per-scene training and fails under sparse views. Moreover, evaluations on limited scenes raise questions about generalization. We present GenWildSplat, a feed-forward framework for sparse-view outdoor reconstruction that requires no per-scene optimization. Given unposed internet images, GenWildSplat predicts depth, camera parameters, and 3D Gaussians in a canonical space using learned geometric priors. An appearance adapter modulates appearance for target lighting conditions, while semantic segmentation handles transient objects. Through curriculum learning on synthetic and real data, GenWildSplat generalizes across diverse illumination and occlusion patterns. Evaluations on PhotoTourism and MegaScenes benchmark demonstrate state-of-the-art feed-forward rendering quality, achieving real-time inference without test-time optimization