EfficientMT: Efficient Temporal Adaptation for Motion Transfer in Text-to-Video Diffusion Models

📅 2025-03-25

📈 Citations: 0

✨ Influential: 0

career value

176K/year

🤖 AI Summary

Current text-to-video (T2V) models exhibit limited motion controllability, and prevailing motion transfer approaches rely on test-time, sample-level optimization—entailing high computational overhead and poor generalization. To address this, we propose the first end-to-end, test-time-optimization-free universal motion transfer framework. Our method introduces a lightweight scaling module and a spatiotemporal fusion mechanism to enable efficient temporal adaptation of pre-trained T2V models. Additionally, we incorporate diffusion-driven temporal feature reuse, synthetic paired-data fine-tuning, reference-video motion feature distillation, and dynamic temporal ensemble. This design achieves flexible motion control while substantially reducing computational cost. Extensive experiments demonstrate state-of-the-art performance across multiple motion transfer benchmarks, including zero-shot cross-video motion transfer—highlighting superior generalization and efficiency.

Technology Category

Application Category

📝 Abstract

The progress on generative models has led to significant advances on text-to-video (T2V) generation, yet the motion controllability of generated videos remains limited. Existing motion transfer methods explored the motion representations of reference videos to guide generation. Nevertheless, these methods typically rely on sample-specific optimization strategy, resulting in high computational burdens. In this paper, we propose extbf{EfficientMT}, a novel and efficient end-to-end framework for video motion transfer. By leveraging a small set of synthetic paired motion transfer samples, EfficientMT effectively adapts a pretrained T2V model into a general motion transfer framework that can accurately capture and reproduce diverse motion patterns. Specifically, we repurpose the backbone of the T2V model to extract temporal information from reference videos, and further propose a scaler module to distill motion-related information. Subsequently, we introduce a temporal integration mechanism that seamlessly incorporates reference motion features into the video generation process. After training on our self-collected synthetic paired samples, EfficientMT enables general video motion transfer without requiring test-time optimization. Extensive experiments demonstrate that our EfficientMT outperforms existing methods in efficiency while maintaining flexible motion controllability. Our code will be available https://github.com/PrototypeNx/EfficientMT.

Problem

Research questions and friction points this paper is trying to address.

Limited motion controllability in text-to-video generation

High computational burden in existing motion transfer methods

Need for efficient general motion transfer framework

Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages synthetic paired motion transfer samples

Repurposes T2V model for temporal information extraction

Introduces temporal integration mechanism for motion features

🔎 Similar Papers

No similar papers found.