🤖 AI Summary
To address motion inconsistency and low fine-tuning efficiency in large video diffusion Transformers, this paper proposes a two-stage fine-tuning framework. Methodologically, it introduces (1) a novel spatiotemporal-decoupled LoRA architecture that explicitly separates spatial appearance modeling from temporal motion modeling; and (2) a sparse motion sampling strategy coupled with adaptive 3D RoPE positional encoding, substantially reducing computational overhead while enhancing temporal modeling capability. Evaluated on our newly constructed MotionBench benchmark—covering diverse motion patterns including camera motion, single/multi-object dynamics, and complex human actions—the method consistently outperforms existing approaches. It achieves superior trade-offs between motion fidelity and inference efficiency, demonstrating that spatiotemporal decoupling simultaneously improves video generation quality and training scalability.
📝 Abstract
Recently, breakthroughs in the video diffusion transformer have shown remarkable capabilities in diverse motion generations. As for the motion-transfer task, current methods mainly use two-stage Low-Rank Adaptations (LoRAs) finetuning to obtain better performance. However, existing adaptation-based motion transfer still suffers from motion inconsistency and tuning inefficiency when applied to large video diffusion transformers. Naive two-stage LoRA tuning struggles to maintain motion consistency between generated and input videos due to the inherent spatial-temporal coupling in the 3D attention operator. Additionally, they require time-consuming fine-tuning processes in both stages. To tackle these issues, we propose Follow-Your-Motion, an efficient two-stage video motion transfer framework that finetunes a powerful video diffusion transformer to synthesize complex motion.Specifically, we propose a spatial-temporal decoupled LoRA to decouple the attention architecture for spatial appearance and temporal motion processing. During the second training stage, we design the sparse motion sampling and adaptive RoPE to accelerate the tuning speed. To address the lack of a benchmark for this field, we introduce MotionBench, a comprehensive benchmark comprising diverse motion, including creative camera motion, single object motion, multiple object motion, and complex human motion. We show extensive evaluations on MotionBench to verify the superiority of Follow-Your-Motion.