Follow-Your-Creation: Empowering 4D Creation through Video Inpainting

📅 2025-06-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenging problem of inpainting occluded regions in monocular video-driven 4D content generation and editing—arising from camera motion or user edits. We propose a novel 4D generation framework formulated as conditional video inpainting, leveraging pretrained video inpainting models’ strong generative priors. Methodologically, we introduce depth-guided point cloud rendering to generate occlusion masks for invisible regions and fuse them with user-provided edit masks to construct composite training data. We further design a self-iterative view-augmentation training strategy and a temporal encapsulation module to jointly ensure temporal consistency under large motions and multi-view coherence. To our knowledge, this is the first approach to cast 4D generation as a video inpainting task. Experiments demonstrate that our method significantly outperforms state-of-the-art methods on both 4D generation and editing benchmarks, enabling precise prompt-driven editing while delivering high-quality, robust, and multi-view-consistent 4D video output without compromising original model fidelity.

Technology Category

Application Category

📝 Abstract
We introduce Follow-Your-Creation, a novel 4D video creation framework capable of both generating and editing 4D content from a single monocular video input. By leveraging a powerful video inpainting foundation model as a generative prior, we reformulate 4D video creation as a video inpainting task, enabling the model to fill in missing content caused by camera trajectory changes or user edits. To facilitate this, we generate composite masked inpainting video data to effectively fine-tune the model for 4D video generation. Given an input video and its associated camera trajectory, we first perform depth-based point cloud rendering to obtain invisibility masks that indicate the regions that should be completed. Simultaneously, editing masks are introduced to specify user-defined modifications, and these are combined with the invisibility masks to create a composite masks dataset. During training, we randomly sample different types of masks to construct diverse and challenging inpainting scenarios, enhancing the model's generalization and robustness in various 4D editing and generation tasks. To handle temporal consistency under large camera motion, we design a self-iterative tuning strategy that gradually increases the viewing angles during training, where the model is used to generate the next-stage training data after each fine-tuning iteration. Moreover, we introduce a temporal packaging module during inference to enhance generation quality. Our method effectively leverages the prior knowledge of the base model without degrading its original performance, enabling the generation of 4D videos with consistent multi-view coherence. In addition, our approach supports prompt-based content editing, demonstrating strong flexibility and significantly outperforming state-of-the-art methods in both quality and versatility.
Problem

Research questions and friction points this paper is trying to address.

Generates and edits 4D content from monocular video input
Reformulates 4D creation as video inpainting task
Ensures temporal consistency under large camera motion
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses video inpainting for 4D content generation
Generates composite masked data for fine-tuning
Self-iterative tuning for temporal consistency
🔎 Similar Papers
No similar papers found.