π€ AI Summary
This work addresses the weak motion controllability and poor text-action alignment in diffusion-based video generation. We propose a training-free video motion transfer method that repurposes Rotary Position Embedding (RoPE) β originally designed for sequence modeling in diffusion Transformers β as an explicit, differentiable motion carrier. Specifically, optical flow is extracted from a reference video to obtain motion trajectories; these trajectories are then deformed via RoPE tensor modulation. During denoising, we jointly optimize a trajectory alignment loss and a Fourier-phase regularization term to ensure precise motion injection and spatiotemporal consistency. To our knowledge, this is the first approach to leverage RoPE for explicit, differentiable motion modeling, effectively eliminating duplicated frames and high-frequency artifacts. Extensive experiments demonstrate state-of-the-art performance across multiple benchmarks, with significant improvements in quantitative metrics. Generated videos exhibit natural motion dynamics, high text fidelity, and strong temporal coherence.
π Abstract
We propose RoPECraft, a training-free video motion transfer method for diffusion transformers that operates solely by modifying their rotary positional embeddings (RoPE). We first extract dense optical flow from a reference video, and utilize the resulting motion offsets to warp the complex-exponential tensors of RoPE, effectively encoding motion into the generation process. These embeddings are then further optimized during denoising time steps via trajectory alignment between the predicted and target velocities using a flow-matching objective. To keep the output faithful to the text prompt and prevent duplicate generations, we incorporate a regularization term based on the phase components of the reference video's Fourier transform, projecting the phase angles onto a smooth manifold to suppress high-frequency artifacts. Experiments on benchmarks reveal that RoPECraft outperforms all recently published methods, both qualitatively and quantitatively.