EfficientMT: Efficient Temporal Adaptation for Motion Transfer in Text-to-Video Diffusion Models

📅 2025-03-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current text-to-video (T2V) models exhibit limited motion controllability, and prevailing motion transfer approaches rely on test-time, sample-level optimization—entailing high computational overhead and poor generalization. To address this, we propose the first end-to-end, test-time-optimization-free universal motion transfer framework. Our method introduces a lightweight scaling module and a spatiotemporal fusion mechanism to enable efficient temporal adaptation of pre-trained T2V models. Additionally, we incorporate diffusion-driven temporal feature reuse, synthetic paired-data fine-tuning, reference-video motion feature distillation, and dynamic temporal ensemble. This design achieves flexible motion control while substantially reducing computational cost. Extensive experiments demonstrate state-of-the-art performance across multiple motion transfer benchmarks, including zero-shot cross-video motion transfer—highlighting superior generalization and efficiency.

Technology Category

Application Category

📝 Abstract
The progress on generative models has led to significant advances on text-to-video (T2V) generation, yet the motion controllability of generated videos remains limited. Existing motion transfer methods explored the motion representations of reference videos to guide generation. Nevertheless, these methods typically rely on sample-specific optimization strategy, resulting in high computational burdens. In this paper, we propose extbf{EfficientMT}, a novel and efficient end-to-end framework for video motion transfer. By leveraging a small set of synthetic paired motion transfer samples, EfficientMT effectively adapts a pretrained T2V model into a general motion transfer framework that can accurately capture and reproduce diverse motion patterns. Specifically, we repurpose the backbone of the T2V model to extract temporal information from reference videos, and further propose a scaler module to distill motion-related information. Subsequently, we introduce a temporal integration mechanism that seamlessly incorporates reference motion features into the video generation process. After training on our self-collected synthetic paired samples, EfficientMT enables general video motion transfer without requiring test-time optimization. Extensive experiments demonstrate that our EfficientMT outperforms existing methods in efficiency while maintaining flexible motion controllability. Our code will be available https://github.com/PrototypeNx/EfficientMT.
Problem

Research questions and friction points this paper is trying to address.

Limited motion controllability in text-to-video generation
High computational burden in existing motion transfer methods
Need for efficient general motion transfer framework
Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages synthetic paired motion transfer samples
Repurposes T2V model for temporal information extraction
Introduces temporal integration mechanism for motion features
🔎 Similar Papers
No similar papers found.
Y
Yufei Cai
Key Laboratory of Intelligent Information Processing, Chinese Academy of Sciences (CAS), Institute of Computing Technology, CAS, University of the Chinese Academy of Sciences
Hu Han
Hu Han
Professor, Institute of Computing Technology, Chinese Academy of Sciences
Computer VisionPattern RecognitionBiometricsMedical Vision Intelligence
Y
Yuxiang Wei
Harbin Institute of Technology
Shiguang Shan
Shiguang Shan
Professor of Institute of Computing Technology, Chinese Academy of Sciences
Computer VisionPattern RecognitionMachine LearningFace Recognition
X
Xilin Chen
Key Laboratory of Intelligent Information Processing, Chinese Academy of Sciences (CAS), Institute of Computing Technology, CAS, University of the Chinese Academy of Sciences