ExScene: Free-View 3D Scene Reconstruction with Gaussian Splatting from a Single Image

📅 2025-03-31
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenge of reconstructing high-fidelity, wide-field-of-view 3D scenes with strong geometric and photometric consistency from a single image—critical for immersive AR/VR applications—this paper proposes a two-stage reconstruction framework. First, a multimodal diffusion model generates globally consistent, high-fidelity panoramic images. Second, panoramic depth estimation is jointly optimized with 3D Gaussian Splatting, guided by video diffusion priors and constrained by camera trajectory optimization to ensure geometric and color consistency. Our key contribution is the first integration of multimodal diffusion generation, panoramic depth estimation, and 2D video-diffusion-guided Gaussian Splatting refinement, enabling joint geometry-color denoising. Experiments demonstrate significant improvements over state-of-the-art methods across multiple metrics; our approach supports free-viewpoint rendering and high-quality immersive experiences using only a single input image.

Technology Category

Application Category

📝 Abstract
The increasing demand for augmented and virtual reality applications has highlighted the importance of crafting immersive 3D scenes from a simple single-view image. However, due to the partial priors provided by single-view input, existing methods are often limited to reconstruct low-consistency 3D scenes with narrow fields of view from single-view input. These limitations make them less capable of generalizing to reconstruct immersive scenes. To address this problem, we propose ExScene, a two-stage pipeline to reconstruct an immersive 3D scene from any given single-view image. ExScene designs a novel multimodal diffusion model to generate a high-fidelity and globally consistent panoramic image. We then develop a panoramic depth estimation approach to calculate geometric information from panorama, and we combine geometric information with high-fidelity panoramic image to train an initial 3D Gaussian Splatting (3DGS) model. Following this, we introduce a GS refinement technique with 2D stable video diffusion priors. We add camera trajectory consistency and color-geometric priors into the denoising process of diffusion to improve color and spatial consistency across image sequences. These refined sequences are then used to fine-tune the initial 3DGS model, leading to better reconstruction quality. Experimental results demonstrate that our ExScene achieves consistent and immersive scene reconstruction using only single-view input, significantly surpassing state-of-the-art baselines.
Problem

Research questions and friction points this paper is trying to address.

Reconstruct immersive 3D scenes from single-view images
Overcome low-consistency and narrow field limitations
Enhance color and spatial consistency in reconstructions
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal diffusion model for panoramic image
Panoramic depth estimation for geometry
GS refinement with video diffusion priors
🔎 Similar Papers
No similar papers found.
T
Tianyi Gong
Shenzhen Future Network of Intelligence Institute, School of Science and Engineering, The Chinese University of Hongkong, Shenzhen
Boyan Li
Boyan Li
The Hong Kong University of Science and Technology (Guangzhou)
DatabasesNatural Language to SQL
Y
Yifei Zhong
Shenzhen Future Network of Intelligence Institute, School of Science and Engineering, The Chinese University of Hongkong, Shenzhen
F
Fangxin Wang
School of Science and Engineering, The Chinese University of Hongkong, Shenzhen, Shenzhen Future Network of Intelligence Institute