🤖 AI Summary
To address the challenge of reconstructing high-fidelity, wide-field-of-view 3D scenes with strong geometric and photometric consistency from a single image—critical for immersive AR/VR applications—this paper proposes a two-stage reconstruction framework. First, a multimodal diffusion model generates globally consistent, high-fidelity panoramic images. Second, panoramic depth estimation is jointly optimized with 3D Gaussian Splatting, guided by video diffusion priors and constrained by camera trajectory optimization to ensure geometric and color consistency. Our key contribution is the first integration of multimodal diffusion generation, panoramic depth estimation, and 2D video-diffusion-guided Gaussian Splatting refinement, enabling joint geometry-color denoising. Experiments demonstrate significant improvements over state-of-the-art methods across multiple metrics; our approach supports free-viewpoint rendering and high-quality immersive experiences using only a single input image.
📝 Abstract
The increasing demand for augmented and virtual reality applications has highlighted the importance of crafting immersive 3D scenes from a simple single-view image. However, due to the partial priors provided by single-view input, existing methods are often limited to reconstruct low-consistency 3D scenes with narrow fields of view from single-view input. These limitations make them less capable of generalizing to reconstruct immersive scenes. To address this problem, we propose ExScene, a two-stage pipeline to reconstruct an immersive 3D scene from any given single-view image. ExScene designs a novel multimodal diffusion model to generate a high-fidelity and globally consistent panoramic image. We then develop a panoramic depth estimation approach to calculate geometric information from panorama, and we combine geometric information with high-fidelity panoramic image to train an initial 3D Gaussian Splatting (3DGS) model. Following this, we introduce a GS refinement technique with 2D stable video diffusion priors. We add camera trajectory consistency and color-geometric priors into the denoising process of diffusion to improve color and spatial consistency across image sequences. These refined sequences are then used to fine-tune the initial 3DGS model, leading to better reconstruction quality. Experimental results demonstrate that our ExScene achieves consistent and immersive scene reconstruction using only single-view input, significantly surpassing state-of-the-art baselines.