LMP: Leveraging Motion Prior in Zero-Shot Video Generation with Diffusion Transformer

📅 2025-05-20

📈 Citations: 0

✨ Influential: 0

career value

194K/year

🤖 AI Summary

Existing video generation models exhibit significant limitations in fine-grained motion control—particularly for complex action descriptions and cross-subject image-to-video motion transfer. This paper introduces the first zero-shot text/image-to-video motion control framework, requiring no fine-tuning or training: given only a reference motion video, it precisely drives the motion of target subjects in a generated video. Methodologically, we propose a novel three-module co-design: foreground-background disentanglement, reweighted motion transfer, and appearance separation. Built upon a pretrained DiT backbone, our approach integrates motion feature extraction, spatiotemporal attention-based motion reweighting, and latent-space appearance suppression. We further construct DAVIS-Motion, a high-fidelity motion-prompt dataset derived from DAVIS. Experiments demonstrate state-of-the-art performance in generation quality, prompt-video alignment, and motion control accuracy—substantially advancing cross-domain motion transfer capability and controllability for complex actions.

Technology Category

Application Category

📝 Abstract

In recent years, large-scale pre-trained diffusion transformer models have made significant progress in video generation. While current DiT models can produce high-definition, high-frame-rate, and highly diverse videos, there is a lack of fine-grained control over the video content. Controlling the motion of subjects in videos using only prompts is challenging, especially when it comes to describing complex movements. Further, existing methods fail to control the motion in image-to-video generation, as the subject in the reference image often differs from the subject in the reference video in terms of initial position, size, and shape. To address this, we propose the Leveraging Motion Prior (LMP) framework for zero-shot video generation. Our framework harnesses the powerful generative capabilities of pre-trained diffusion transformers to enable motion in the generated videos to reference user-provided motion videos in both text-to-video and image-to-video generation. To this end, we first introduce a foreground-background disentangle module to distinguish between moving subjects and backgrounds in the reference video, preventing interference in the target video generation. A reweighted motion transfer module is designed to allow the target video to reference the motion from the reference video. To avoid interference from the subject in the reference video, we propose an appearance separation module to suppress the appearance of the reference subject in the target video. We annotate the DAVIS dataset with detailed prompts for our experiments and design evaluation metrics to validate the effectiveness of our method. Extensive experiments demonstrate that our approach achieves state-of-the-art performance in generation quality, prompt-video consistency, and control capability. Our homepage is available at https://vpx-ecnu.github.io/LMP-Website/

Problem

Research questions and friction points this paper is trying to address.

Lack of fine-grained control over video content in DiT models

Challenges in controlling motion using prompts for complex movements

Difficulty in motion control for image-to-video generation due to subject differences

Innovation

Methods, ideas, or system contributions that make the work stand out.

Foreground-background disentangle module for motion control

Reweighted motion transfer module for reference motion

Appearance separation module to suppress reference subject

🔎 Similar Papers

No similar papers found.