PaintScene4D: Consistent 4D Scene Generation from Text Prompts

📅 2024-12-05

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses key challenges in text-to-dynamic 4D scene generation—namely, weak spatial understanding, uncontrollable camera viewpoints, and multi-view spatiotemporal inconsistency—by proposing the first training-free, text-driven 4D generation framework. Methodologically, it leverages a pre-trained video diffusion model to generate reference videos, integrates dynamic camera array selection, progressive optical flow-based registration, and multi-frame joint inpainting, and employs a differentiable dynamic neural renderer to enforce cross-view spatiotemporal consistency. Contributions include: (1) the first zero-shot, arbitrary-camera-trajectory-controllable 4D scene generation; (2) elimination of object-centric assumptions and reliance on synthetic data; and (3) high-fidelity, physically plausible motion modeling with consistent dynamics across multiple views on real-world data—significantly advancing scene-level dynamic 4D generation capability.

Technology Category

Application Category

📝 Abstract

Recent advances in diffusion models have revolutionized 2D and 3D content creation, yet generating photorealistic dynamic 4D scenes remains a significant challenge. Existing dynamic 4D generation methods typically rely on distilling knowledge from pre-trained 3D generative models, often fine-tuned on synthetic object datasets. Consequently, the resulting scenes tend to be object-centric and lack photorealism. While text-to-video models can generate more realistic scenes with motion, they often struggle with spatial understanding and provide limited control over camera viewpoints during rendering. To address these limitations, we present PaintScene4D, a novel text-to-4D scene generation framework that departs from conventional multi-view generative models in favor of a streamlined architecture that harnesses video generative models trained on diverse real-world datasets. Our method first generates a reference video using a video generation model, and then employs a strategic camera array selection for rendering. We apply a progressive warping and inpainting technique to ensure both spatial and temporal consistency across multiple viewpoints. Finally, we optimize multi-view images using a dynamic renderer, enabling flexible camera control based on user preferences. Adopting a training-free architecture, our PaintScene4D efficiently produces realistic 4D scenes that can be viewed from arbitrary trajectories. The code will be made publicly available. Our project page is at https://paintscene4d.github.io/

Problem

Research questions and friction points this paper is trying to address.

Generating photorealistic dynamic 4D scenes from text

Overcoming limitations in spatial understanding and camera control

Ensuring consistency across multiple viewpoints in 4D generation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages video generative models for realism

Uses progressive warping for consistency

Optimizes multi-view images dynamically

🔎 Similar Papers

No similar papers found.

Authors to Follow