MotionCanvas: Cinematic Shot Design with Controllable Image-to-Video Generation

📅 2025-02-06

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

This paper addresses the lack of 3D awareness and motion controllability in image-to-video (I2V) generation. To this end, we propose the first framework that integrates classical motion modeling with video diffusion models. Methodologically: (1) we jointly parameterize camera and object motions in a shared 3D scene space to generate geometrically grounded spatiotemporal motion conditioning signals; (2) we introduce a motion canvas interface and a novel motion-conditioning mechanism to enable intuitive, user-guided motion specification; and (3) we fine-tune only a 2D image diffusion model—requiring no 3D annotations or synthetic data. Experiments demonstrate stable generation of short videos from real images, featuring precise camera trajectories and rich cinematic language. Our approach significantly outperforms existing I2V methods in motion controllability, 3D consistency, and visual fidelity.

Technology Category

Application Category

📝 Abstract

This paper presents a method that allows users to design cinematic video shots in the context of image-to-video generation. Shot design, a critical aspect of filmmaking, involves meticulously planning both camera movements and object motions in a scene. However, enabling intuitive shot design in modern image-to-video generation systems presents two main challenges: first, effectively capturing user intentions on the motion design, where both camera movements and scene-space object motions must be specified jointly; and second, representing motion information that can be effectively utilized by a video diffusion model to synthesize the image animations. To address these challenges, we introduce MotionCanvas, a method that integrates user-driven controls into image-to-video (I2V) generation models, allowing users to control both object and camera motions in a scene-aware manner. By connecting insights from classical computer graphics and contemporary video generation techniques, we demonstrate the ability to achieve 3D-aware motion control in I2V synthesis without requiring costly 3D-related training data. MotionCanvas enables users to intuitively depict scene-space motion intentions, and translates them into spatiotemporal motion-conditioning signals for video diffusion models. We demonstrate the effectiveness of our method on a wide range of real-world image content and shot-design scenarios, highlighting its potential to enhance the creative workflows in digital content creation and adapt to various image and video editing applications.

Problem

Research questions and friction points this paper is trying to address.

Enables cinematic shot design in image-to-video generation.

Integrates user-driven controls for object and camera motions.

Achieves 3D-aware motion control without costly 3D data.

Innovation

Methods, ideas, or system contributions that make the work stand out.

User-driven motion control integration

3D-aware motion without 3D data

Spatiotemporal motion-conditioning signal translation

🔎 Similar Papers

No similar papers found.

TikTok

San Jose, California

AIML - Machine Learning Researcher, DMLI- Image/Video Generation

Apple

Santa Clara, United States of America

Authors to Follow