Puppet-Master: Scaling Interactive Video Generation as a Motion Prior for Part-Level Dynamics

📅 2024-08-08
🏛️ arXiv.org
📈 Citations: 12
Influential: 1
📄 PDF
🤖 AI Summary
This work addresses the challenge of modeling part-level motion within objects for interactive video generation. We propose a zero-shot, single-image–driven method conditioned on sparse drag trajectories for part motion synthesis. Our contributions are threefold: (1) an all-to-first attention mechanism that replaces conventional spatial attention to better capture inter-part motion dependencies; (2) Objaverse-Animation-HQ—the first large-scale, high-quality dataset explicitly designed for part-level animation—accompanied by an automated trajectory augmentation strategy; and (3) a novel drag-condition injection architecture that jointly integrates synthetic rendering with motion trajectory filtering. Evaluated on real-world benchmarks, our method achieves zero-shot performance surpassing state-of-the-art approaches, significantly improving part motion fidelity and background stability while demonstrating strong cross-category generalization.

Technology Category

Application Category

📝 Abstract
We present Puppet-Master, an interactive video generative model that can serve as a motion prior for part-level dynamics. At test time, given a single image and a sparse set of motion trajectories (i.e., drags), Puppet-Master can synthesize a video depicting realistic part-level motion faithful to the given drag interactions. This is achieved by fine-tuning a large-scale pre-trained video diffusion model, for which we propose a new conditioning architecture to inject the dragging control effectively. More importantly, we introduce the all-to-first attention mechanism, a drop-in replacement for the widely adopted spatial attention modules, which significantly improves generation quality by addressing the appearance and background issues in existing models. Unlike other motion-conditioned video generators that are trained on in-the-wild videos and mostly move an entire object, Puppet-Master is learned from Objaverse-Animation-HQ, a new dataset of curated part-level motion clips. We propose a strategy to automatically filter out sub-optimal animations and augment the synthetic renderings with meaningful motion trajectories. Puppet-Master generalizes well to real images across various categories and outperforms existing methods in a zero-shot manner on a real-world benchmark. See our project page for more results: vgg-puppetmaster.github.io.
Problem

Research questions and friction points this paper is trying to address.

Generating part-level object motion in videos
Translating user drag inputs into dynamic animations
Overcoming artifacts from fine-tuning video generators
Innovation

Methods, ideas, or system contributions that make the work stand out.

Extending pre-trained image-to-video generator
Proposing all-to-first attention mechanism
Fine-tuning on curated synthetic motion dataset
🔎 Similar Papers
No similar papers found.