SpatialCrafter: Unleashing the Imagination of Video Diffusion Models for Scene Reconstruction from Limited Observations

📅 2025-05-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
3D scene reconstruction from sparse or single-view inputs suffers from severe geometric ambiguity and insufficient multi-view geometric constraints. Method: We propose a novel approach integrating generative priors with explicit geometric modeling. Specifically, we pioneer the use of video diffusion models for observation augmentation to mitigate single-frame information deficiency; design a trainable camera encoder and an epipolar attention mechanism to enforce cross-view geometric consistency; and construct a joint depth-semantic latent space that unifies scale estimation and 3D Gaussian primitive regression. Contribution/Results: Our method eliminates reliance on dense multi-view inputs, achieving high-fidelity, geometrically consistent, and photorealistic 3D reconstructions across multiple benchmarks. It significantly improves reconstruction quality and generalization under sparse-view conditions.

Technology Category

Application Category

📝 Abstract
Novel view synthesis (NVS) boosts immersive experiences in computer vision and graphics. Existing techniques, though progressed, rely on dense multi-view observations, restricting their application. This work takes on the challenge of reconstructing photorealistic 3D scenes from sparse or single-view inputs. We introduce SpatialCrafter, a framework that leverages the rich knowledge in video diffusion models to generate plausible additional observations, thereby alleviating reconstruction ambiguity. Through a trainable camera encoder and an epipolar attention mechanism for explicit geometric constraints, we achieve precise camera control and 3D consistency, further reinforced by a unified scale estimation strategy to handle scale discrepancies across datasets. Furthermore, by integrating monocular depth priors with semantic features in the video latent space, our framework directly regresses 3D Gaussian primitives and efficiently processes long-sequence features using a hybrid network structure. Extensive experiments show our method enhances sparse view reconstruction and restores the realistic appearance of 3D scenes.
Problem

Research questions and friction points this paper is trying to address.

Reconstructing 3D scenes from sparse or single-view inputs
Generating plausible observations to reduce reconstruction ambiguity
Ensuring 3D consistency and handling scale discrepancies
Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages video diffusion models for scene reconstruction
Uses epipolar attention for 3D consistency
Integrates depth priors with semantic features
🔎 Similar Papers
No similar papers found.
Songchun Zhang
Songchun Zhang
The Hongkong University of Science and Technology
Generative AI
H
Huiyao Xu
Zhejiang University
S
Sitong Guo
Zhejiang University
Z
Zhongwei Xie
Wuhan University
P
Pengwei Liu
Zhejiang University
H
Hujun Bao
Zhejiang University
W
Weiwei Xu
Zhejiang University
C
Changqing Zou
Zhejiang University, Zhejiang Lab