🤖 AI Summary
This work addresses the challenging problem of Neural Radiance Field (NeRF) scene reconstruction from only 3–6 unposed images—particularly under sparse features, large baselines, or severely limited viewpoints. To overcome the reliance on conventional calibration targets, we propose using common everyday objects as “pose probes.” Methodologically, we first segment the probe object via SAM, then jointly estimate initial camera poses using signed distance function (SDF)-guided modeling and Perspective-n-Point (PnP) optimization. Next, we introduce a dual-branch NeRF architecture—separately modeling the probe object and the background scene—and jointly optimize geometry, appearance, and camera poses. Finally, we perform incremental multi-view pose refinement to eliminate dependence on COLMAP and dense feature matching. Experiments demonstrate state-of-the-art performance in both pose estimation and novel-view synthesis under extreme view scarcity and large baselines, significantly outperforming feature-matching-based pipelines. The method exhibits strong robustness and consistency across diverse probe objects.
📝 Abstract
Radiance fields including NeRFs and 3D Gaussians demonstrate great potential in high-fidelity rendering and scene reconstruction, while they require a substantial number of posed images as inputs. COLMAP is frequently employed for preprocessing to estimate poses, while it necessitates a large number of feature matches to operate effectively, and it struggles with scenes characterized by sparse features, large baselines between images, or a limited number of input images. We aim to tackle few-view NeRF reconstruction using only 3 to 6 unposed scene images. Traditional methods often use calibration boards but they are not common in images. We propose a novel idea of utilizing everyday objects, commonly found in both images and real life, as"pose probes". The probe object is automatically segmented by SAM, whose shape is initialized from a cube. We apply a dual-branch volume rendering optimization (object NeRF and scene NeRF) to constrain the pose optimization and jointly refine the geometry. Specifically, object poses of two views are first estimated by PnP matching in an SDF representation, which serves as initial poses. PnP matching, requiring only a few features, is suitable for feature-sparse scenes. Additional views are incrementally incorporated to refine poses from preceding views. In experiments, PoseProbe achieves state-of-the-art performance in both pose estimation and novel view synthesis across multiple datasets. We demonstrate its effectiveness, particularly in few-view and large-baseline scenes where COLMAP struggles. In ablations, using different objects in a scene yields comparable performance. Our project page is available at: href{https://zhirui-gao.github.io/PoseProbe.github.io/}{this https URL}