MotionMatcher: Motion Customization of Text-to-Video Diffusion Models via Motion Feature Matching

📅 2025-02-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Text-to-video (T2V) diffusion models struggle with fine-grained motion control and accurate camera motion replication. To address this, we propose a reference-guided video generation framework that performs motion matching in feature space—not pixel space—for the first time, effectively decoupling motion from content and preventing content leakage. Our method leverages pre-trained T2V models to extract spatiotemporal motion features, then applies lightweight fine-tuning via feature-level knowledge distillation and contrastive loss. Experiments demonstrate state-of-the-art performance across multiple motion fidelity metrics, significantly improving motion consistency, temporal coherence, and text-motion alignment accuracy. This work establishes an efficient, high-fidelity paradigm for controllable video generation.

Technology Category

Application Category

📝 Abstract
Text-to-video (T2V) diffusion models have shown promising capabilities in synthesizing realistic videos from input text prompts. However, the input text description alone provides limited control over the precise objects movements and camera framing. In this work, we tackle the motion customization problem, where a reference video is provided as motion guidance. While most existing methods choose to fine-tune pre-trained diffusion models to reconstruct the frame differences of the reference video, we observe that such strategy suffer from content leakage from the reference video, and they cannot capture complex motion accurately. To address this issue, we propose MotionMatcher, a motion customization framework that fine-tunes the pre-trained T2V diffusion model at the feature level. Instead of using pixel-level objectives, MotionMatcher compares high-level, spatio-temporal motion features to fine-tune diffusion models, ensuring precise motion learning. For the sake of memory efficiency and accessibility, we utilize a pre-trained T2V diffusion model, which contains considerable prior knowledge about video motion, to compute these motion features. In our experiments, we demonstrate state-of-the-art motion customization performances, validating the design of our framework.
Problem

Research questions and friction points this paper is trying to address.

Customize motion in text-to-video models
Enhance motion control via feature matching
Address content leakage in video synthesis
Innovation

Methods, ideas, or system contributions that make the work stand out.

Feature-level fine-tuning
Spatio-temporal motion features
Pre-trained T2V model