VideoFrom3D: 3D Scene Video Generation via Complementary Image and Video Diffusion Models

📅 2025-09-22

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the challenge of simultaneously achieving high visual fidelity, natural motion dynamics, and strong temporal consistency in video diffusion models for complex 3D scenes. We propose a high-quality 3D-scene video synthesis method that requires no paired 3D–2D data. Methodologically, we introduce an image-video diffusion co-generation framework: (i) a sparse appearance-guided sampling image diffusion model generates key anchor views; (ii) a flow-aware camera-controlled and geometry-structured video diffusion model performs high-fidelity, temporally consistent intermediate-frame interpolation. Experiments demonstrate that our approach produces stylized, detail-rich, and motion-coherent 3D-scene videos across diverse complex scenarios, significantly outperforming existing baselines. Our method establishes a novel paradigm for rapid, 3D-model-free content generation, advancing the state of the art in diffusion-based 3D video synthesis.

Technology Category

Application Category

📝 Abstract

In this paper, we propose VideoFrom3D, a novel framework for synthesizing high-quality 3D scene videos from coarse geometry, a camera trajectory, and a reference image. Our approach streamlines the 3D graphic design workflow, enabling flexible design exploration and rapid production of deliverables. A straightforward approach to synthesizing a video from coarse geometry might condition a video diffusion model on geometric structure. However, existing video diffusion models struggle to generate high-fidelity results for complex scenes due to the difficulty of jointly modeling visual quality, motion, and temporal consistency. To address this, we propose a generative framework that leverages the complementary strengths of image and video diffusion models. Specifically, our framework consists of a Sparse Anchor-view Generation (SAG) and a Geometry-guided Generative Inbetweening (GGI) module. The SAG module generates high-quality, cross-view consistent anchor views using an image diffusion model, aided by Sparse Appearance-guided Sampling. Building on these anchor views, GGI module faithfully interpolates intermediate frames using a video diffusion model, enhanced by flow-based camera control and structural guidance. Notably, both modules operate without any paired dataset of 3D scene models and natural images, which is extremely difficult to obtain. Comprehensive experiments show that our method produces high-quality, style-consistent scene videos under diverse and challenging scenarios, outperforming simple and extended baselines.

Problem

Research questions and friction points this paper is trying to address.

Generating high-quality 3D scene videos from coarse geometry and camera trajectories

Addressing video diffusion models' limitations in handling complex scene fidelity

Creating style-consistent videos without requiring paired 3D-image datasets

Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages complementary image and video diffusion models

Uses sparse anchor-view generation with appearance-guided sampling

Implements geometry-guided generative inbetweening for interpolation

🔎 Similar Papers

No similar papers found.

Authors to Follow