🤖 AI Summary
Current text-to-video (T2V) models exhibit limited motion controllability, and prevailing motion transfer approaches rely on test-time, sample-level optimization—entailing high computational overhead and poor generalization. To address this, we propose the first end-to-end, test-time-optimization-free universal motion transfer framework. Our method introduces a lightweight scaling module and a spatiotemporal fusion mechanism to enable efficient temporal adaptation of pre-trained T2V models. Additionally, we incorporate diffusion-driven temporal feature reuse, synthetic paired-data fine-tuning, reference-video motion feature distillation, and dynamic temporal ensemble. This design achieves flexible motion control while substantially reducing computational cost. Extensive experiments demonstrate state-of-the-art performance across multiple motion transfer benchmarks, including zero-shot cross-video motion transfer—highlighting superior generalization and efficiency.
📝 Abstract
The progress on generative models has led to significant advances on text-to-video (T2V) generation, yet the motion controllability of generated videos remains limited. Existing motion transfer methods explored the motion representations of reference videos to guide generation. Nevertheless, these methods typically rely on sample-specific optimization strategy, resulting in high computational burdens. In this paper, we propose extbf{EfficientMT}, a novel and efficient end-to-end framework for video motion transfer. By leveraging a small set of synthetic paired motion transfer samples, EfficientMT effectively adapts a pretrained T2V model into a general motion transfer framework that can accurately capture and reproduce diverse motion patterns. Specifically, we repurpose the backbone of the T2V model to extract temporal information from reference videos, and further propose a scaler module to distill motion-related information. Subsequently, we introduce a temporal integration mechanism that seamlessly incorporates reference motion features into the video generation process. After training on our self-collected synthetic paired samples, EfficientMT enables general video motion transfer without requiring test-time optimization. Extensive experiments demonstrate that our EfficientMT outperforms existing methods in efficiency while maintaining flexible motion controllability. Our code will be available https://github.com/PrototypeNx/EfficientMT.