Generative Video Propagation

📅 2024-12-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Addressing the challenge of object-level video editing—such as insertion, deletion, deformation, and tracking—where spatiotemporal coherence and physical plausibility are difficult to jointly ensure, this paper proposes GenProp. Our method employs a selective content encoder to extract semantic representations of target objects from the first frame and leverages an image-to-video diffusion model for cross-frame propagation. To enhance geometric fidelity and physical consistency, we introduce a region-aware loss and a jointly trained mask prediction head, enabling deformation-aware editing, independent motion injection, global removal of shadows/reflections, and effect-coupled object tracking. Furthermore, we design an instance-level video segmentation–based synthetic data strategy to improve generalization. Extensive experiments demonstrate that GenProp achieves state-of-the-art performance across diverse video editing and generation benchmarks, significantly improving both spatiotemporal coherence and physical plausibility at the object level.

Technology Category

Application Category

📝 Abstract
Large-scale video generation models have the inherent ability to realistically model natural scenes. In this paper, we demonstrate that through a careful design of a generative video propagation framework, various video tasks can be addressed in a unified way by leveraging the generative power of such models. Specifically, our framework, GenProp, encodes the original video with a selective content encoder and propagates the changes made to the first frame using an image-to-video generation model. We propose a data generation scheme to cover multiple video tasks based on instance-level video segmentation datasets. Our model is trained by incorporating a mask prediction decoder head and optimizing a region-aware loss to aid the encoder to preserve the original content while the generation model propagates the modified region. This novel design opens up new possibilities: In editing scenarios, GenProp allows substantial changes to an object's shape; for insertion, the inserted objects can exhibit independent motion; for removal, GenProp effectively removes effects like shadows and reflections from the whole video; for tracking, GenProp is capable of tracking objects and their associated effects together. Experiment results demonstrate the leading performance of our model in various video tasks, and we further provide in-depth analyses of the proposed framework.
Problem

Research questions and friction points this paper is trying to address.

Video Editing
Object Manipulation
Video Quality
Innovation

Methods, ideas, or system contributions that make the work stand out.

GenProp
Mask Prediction Decoder
Region-aware Loss Optimization
🔎 Similar Papers