Motion Prompting: Controlling Video Generation with Motion Trajectories

📅 2024-12-03

🏛️ arXiv.org

📈 Citations: 13

✨ Influential: 1

career value

205K/year

🤖 AI Summary

Existing video generation models rely heavily on text prompts, which lack precise spatiotemporal control over dynamic motion and complex action composition. To address this, we propose Motion Prompting—a novel conditioning framework that leverages variable-granularity motion trajectories (sparse/dense, object-level/global/temporal) to enable fine-grained control over camera/object motion, image interaction, motion transfer, and editing. Methodologically, we introduce the first trajectory encoder coupled with a spatiotemporal attention fusion architecture, complemented by motion-guided latent-space optimization and a semantic-driven motion prompt expansion mechanism that automatically maps high-level semantics into detailed motion signals. Quantitative evaluations and human studies across multiple tasks demonstrate significant improvements over state-of-the-art baselines. Generated videos exhibit enhanced physical plausibility and emergent behaviors, establishing a new paradigm for interactive video generation in embodied world modeling.

Technology Category

Application Category

📝 Abstract

Motion control is crucial for generating expressive and compelling video content; however, most existing video generation models rely mainly on text prompts for control, which struggle to capture the nuances of dynamic actions and temporal compositions. To this end, we train a video generation model conditioned on spatio-temporally sparse or dense motion trajectories. In contrast to prior motion conditioning work, this flexible representation can encode any number of trajectories, object-specific or global scene motion, and temporally sparse motion; due to its flexibility we refer to this conditioning as motion prompts. While users may directly specify sparse trajectories, we also show how to translate high-level user requests into detailed, semi-dense motion prompts, a process we term motion prompt expansion. We demonstrate the versatility of our approach through various applications, including camera and object motion control,"interacting"with an image, motion transfer, and image editing. Our results showcase emergent behaviors, such as realistic physics, suggesting the potential of motion prompts for probing video models and interacting with future generative world models. Finally, we evaluate quantitatively, conduct a human study, and demonstrate strong performance. Video results are available on our webpage: https://motion-prompting.github.io/

Problem

Research questions and friction points this paper is trying to address.

Control video generation using motion trajectories instead of text prompts

Encode flexible motion representations for object-specific or global scene motion

Translate high-level user requests into detailed motion prompts for diverse applications

Innovation

Methods, ideas, or system contributions that make the work stand out.

Video generation model with motion trajectories

Flexible motion prompts for dynamic actions

Motion prompt expansion for user requests

🔎 Similar Papers

Reenact Anything: Semantic Video Motion Transfer Using Motion-Textual Inversion

2024-08-01arXiv.orgCitations: 4

TikTok

San Jose, California

Research Engineer/Scientist (all levels), World Models

TikTok

San Jose, California

AI Research Scientist, Computer Vision - Facebook Video Intelligence