🤖 AI Summary
To address the challenge of scene-level 3D reconstruction from sparse, uncalibrated images, this paper proposes a coarse-to-fine co-optimization framework that jointly refines 3D Gaussian Splatting (3DGS) scene representations and camera poses. Methodologically, it introduces a novel three-stage co-optimization mechanism: (i) robust geometric initialization via multi-view stereo (MVS), (ii) confidence-aware depth alignment, and (iii) warping-guided diffusion-based inpainting. By integrating monocular depth estimation with differentiable image warping, the method significantly improves geometric consistency and texture fidelity under sparse-view conditions. Evaluated on novel view synthesis and camera pose estimation, it achieves state-of-the-art performance in reconstruction quality (PSNR/SSIM), real-time rendering efficiency, and calibration robustness—without requiring prior scene knowledge or dense inputs. This work establishes a new paradigm for prior-free, real-world 3D scene modeling.
📝 Abstract
Photo-realistic scene reconstruction from sparse-view, uncalibrated images is highly required in practice. Although some successes have been made, existing methods are either Sparse-View but require accurate camera parameters (i.e., intrinsic and extrinsic), or SfM-free but need densely captured images. To combine the advantages of both methods while addressing their respective weaknesses, we propose Dust to Tower (D2T), an accurate and efficient coarse-to-fine framework to optimize 3DGS and image poses simultaneously from sparse and uncalibrated images. Our key idea is to first construct a coarse model efficiently and subsequently refine it using warped and inpainted images at novel viewpoints. To do this, we first introduce a Coarse Construction Module (CCM) which exploits a fast Multi-View Stereo model to initialize a 3D Gaussian Splatting (3DGS) and recover initial camera poses. To refine the 3D model at novel viewpoints, we propose a Confidence Aware Depth Alignment (CADA) module to refine the coarse depth maps by aligning their confident parts with estimated depths by a Mono-depth model. Then, a Warped Image-Guided Inpainting (WIGI) module is proposed to warp the training images to novel viewpoints by the refined depth maps, and inpainting is applied to fulfill the ``holes"in the warped images caused by view-direction changes, providing high-quality supervision to further optimize the 3D model and the camera poses. Extensive experiments and ablation studies demonstrate the validity of D2T and its design choices, achieving state-of-the-art performance in both tasks of novel view synthesis and pose estimation while keeping high efficiency. Codes will be publicly available.