Scene Splatter: Momentum 3D Scene Generation from Single Image with Video Diffusion Model

📅 2025-04-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing single-image novel-view synthesis methods rely on video diffusion models, which suffer from limited sequence length and poor inter-frame consistency, leading to artifacts and geometric distortions in 3D reconstruction. To address these limitations, we propose an iterative Gaussian optimization framework guided by hierarchical momentum mechanisms. Our method jointly incorporates implicit feature-level and explicit pixel-level momentum to enhance cross-view consistency and improve inpainting of unobserved regions. By integrating multi-scale momentum guidance, iterative refinement of implicit Gaussian representations, and real-time rendering updates, our approach enables progressive 3D scene reconstruction. Extensive experiments across diverse complex scenes demonstrate significant improvements in novel-view synthesis quality, geometric accuracy, and temporal stability. Moreover, our method supports longer synthesis sequences and higher-fidelity single-image-to-3D reconstruction compared to prior approaches.

Technology Category

Application Category

📝 Abstract
In this paper, we propose Scene Splatter, a momentum-based paradigm for video diffusion to generate generic scenes from single image. Existing methods, which employ video generation models to synthesize novel views, suffer from limited video length and scene inconsistency, leading to artifacts and distortions during further reconstruction. To address this issue, we construct noisy samples from original features as momentum to enhance video details and maintain scene consistency. However, for latent features with the perception field that spans both known and unknown regions, such latent-level momentum restricts the generative ability of video diffusion in unknown regions. Therefore, we further introduce the aforementioned consistent video as a pixel-level momentum to a directly generated video without momentum for better recovery of unseen regions. Our cascaded momentum enables video diffusion models to generate both high-fidelity and consistent novel views. We further finetune the global Gaussian representations with enhanced frames and render new frames for momentum update in the next step. In this manner, we can iteratively recover a 3D scene, avoiding the limitation of video length. Extensive experiments demonstrate the generalization capability and superior performance of our method in high-fidelity and consistent scene generation.
Problem

Research questions and friction points this paper is trying to address.

Generating 3D scenes from single images using video diffusion models
Addressing limited video length and scene inconsistency in view synthesis
Enhancing fidelity and consistency in novel view generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Momentum-based video diffusion for scene generation
Cascaded momentum enhances fidelity and consistency
Iterative 3D scene recovery avoids video length limits
🔎 Similar Papers
S
Shengjun Zhang
Tsinghua University
J
Jinzhao Li
Tsinghua University
Xin Fei
Xin Fei
National University of Singapore
Robotic ManipulationComputer Vision
H
Hao Liu
WeChat Vision, Tecent Inc.
Y
Yueqi Duan
Tsinghua University