VideoScene: Distilling Video Diffusion Model to Generate 3D Scenes in One Step

📅 2025-04-02

📈 Citations: 0

✨ Influential: 0

career value

209K/year

🤖 AI Summary

3D scene reconstruction from sparse views is ill-posed, leading to geometric distortions and computational inefficiency. Conventional methods rely on strong inter-view overlap, while video diffusion models suffer from slow inference and lack of explicit 3D constraints, resulting in geometrically inconsistent artifacts. This paper proposes the first single-step generative 3D reconstruction framework: leveraging a pre-trained video diffusion model, we introduce three key innovations—3D-aware leap flow distillation, a dynamic timestep decision network, and implicit 3D representation optimization—to enable end-to-end, single-step generation of high-fidelity 3D scenes from sparse video inputs. Our method achieves state-of-the-art reconstruction quality across multiple benchmarks, accelerates inference by 5.2× over prior diffusion-based approaches, and significantly suppresses geometric artifacts. To our knowledge, this is the first diffusion-based 3D generation method that simultaneously ensures strict 3D consistency and high computational efficiency.

Technology Category

Application Category

📝 Abstract

Recovering 3D scenes from sparse views is a challenging task due to its inherent ill-posed problem. Conventional methods have developed specialized solutions (e.g., geometry regularization or feed-forward deterministic model) to mitigate the issue. However, they still suffer from performance degradation by minimal overlap across input views with insufficient visual information. Fortunately, recent video generative models show promise in addressing this challenge as they are capable of generating video clips with plausible 3D structures. Powered by large pretrained video diffusion models, some pioneering research start to explore the potential of video generative prior and create 3D scenes from sparse views. Despite impressive improvements, they are limited by slow inference time and the lack of 3D constraint, leading to inefficiencies and reconstruction artifacts that do not align with real-world geometry structure. In this paper, we propose VideoScene to distill the video diffusion model to generate 3D scenes in one step, aiming to build an efficient and effective tool to bridge the gap from video to 3D. Specifically, we design a 3D-aware leap flow distillation strategy to leap over time-consuming redundant information and train a dynamic denoising policy network to adaptively determine the optimal leap timestep during inference. Extensive experiments demonstrate that our VideoScene achieves faster and superior 3D scene generation results than previous video diffusion models, highlighting its potential as an efficient tool for future video to 3D applications. Project Page: https://hanyang-21.github.io/VideoScene

Problem

Research questions and friction points this paper is trying to address.

Recovering 3D scenes from sparse views efficiently

Overcoming slow inference in video diffusion models

Ensuring 3D constraints align with real-world geometry

Innovation

Methods, ideas, or system contributions that make the work stand out.

Distills video diffusion model for one-step 3D generation

Uses 3D-aware leap flow distillation strategy

Trains dynamic denoising policy network adaptively

🔎 Similar Papers

SV4D: Dynamic 3D Content Generation with Multi-Frame and Multi-View Consistency