Scene Splatter: Momentum 3D Scene Generation from Single Image with Video Diffusion Model

📅 2025-04-03

📈 Citations: 0

✨ Influential: 0

career value

223K/year

🤖 AI Summary

Existing single-image novel-view synthesis methods rely on video diffusion models, which suffer from limited sequence length and poor inter-frame consistency, leading to artifacts and geometric distortions in 3D reconstruction. To address these limitations, we propose an iterative Gaussian optimization framework guided by hierarchical momentum mechanisms. Our method jointly incorporates implicit feature-level and explicit pixel-level momentum to enhance cross-view consistency and improve inpainting of unobserved regions. By integrating multi-scale momentum guidance, iterative refinement of implicit Gaussian representations, and real-time rendering updates, our approach enables progressive 3D scene reconstruction. Extensive experiments across diverse complex scenes demonstrate significant improvements in novel-view synthesis quality, geometric accuracy, and temporal stability. Moreover, our method supports longer synthesis sequences and higher-fidelity single-image-to-3D reconstruction compared to prior approaches.

Technology Category

Application Category

📝 Abstract

In this paper, we propose Scene Splatter, a momentum-based paradigm for video diffusion to generate generic scenes from single image. Existing methods, which employ video generation models to synthesize novel views, suffer from limited video length and scene inconsistency, leading to artifacts and distortions during further reconstruction. To address this issue, we construct noisy samples from original features as momentum to enhance video details and maintain scene consistency. However, for latent features with the perception field that spans both known and unknown regions, such latent-level momentum restricts the generative ability of video diffusion in unknown regions. Therefore, we further introduce the aforementioned consistent video as a pixel-level momentum to a directly generated video without momentum for better recovery of unseen regions. Our cascaded momentum enables video diffusion models to generate both high-fidelity and consistent novel views. We further finetune the global Gaussian representations with enhanced frames and render new frames for momentum update in the next step. In this manner, we can iteratively recover a 3D scene, avoiding the limitation of video length. Extensive experiments demonstrate the generalization capability and superior performance of our method in high-fidelity and consistent scene generation.

Problem

Research questions and friction points this paper is trying to address.

Generating 3D scenes from single images using video diffusion models

Addressing limited video length and scene inconsistency in view synthesis

Enhancing fidelity and consistency in novel view generation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Momentum-based video diffusion for scene generation

Cascaded momentum enhances fidelity and consistency

Iterative 3D scene recovery avoids video length limits

🔎 Similar Papers

ReconX: Reconstruct Any Scene from Sparse Views with Video Diffusion Model