🤖 AI Summary
Existing video motion editing methods are largely confined to simple transformations (e.g., translation, scaling) and struggle to accurately transfer complex semantic motions—such as full-body gestures, facial expressions, object dynamics, or camera motion—using only a single reference video. This work proposes a semantic-level video motion transfer framework that enables precise motion transfer from one reference video to arbitrary target images—without requiring spatial alignment. Our approach leverages a pre-trained image-video diffusion model and introduces three key innovations: (1) motion-textual inversion, a novel technique that decouples appearance and motion representations via joint embedding of motion and textual semantics; (2) frame-wise dilated motion embeddings for high-temporal-resolution motion encoding; and (3) implicit motion encoding coupled with cross-attention modulation to enable zero-shot cross-domain transfer. Extensive experiments demonstrate significant improvements over state-of-the-art methods in motion fidelity, generalizability, and temporal consistency.
📝 Abstract
Recent years have seen a tremendous improvement in the quality of video generation and editing approaches. While several techniques focus on editing appearance, few address motion. Current approaches using text, trajectories, or bounding boxes are limited to simple motions, so we specify motions with a single motion reference video instead. We further propose to use a pre-trained image-to-video model rather than a text-to-video model. This approach allows us to preserve the exact appearance and position of a target object or scene and helps disentangle appearance from motion. Our method, called motion-textual inversion, leverages our observation that image-to-video models extract appearance mainly from the (latent) image input, while the text/image embedding injected via cross-attention predominantly controls motion. We thus represent motion using text/image embedding tokens. By operating on an inflated motion-text embedding containing multiple text/image embedding tokens per frame, we achieve a high temporal motion granularity. Once optimized on the motion reference video, this embedding can be applied to various target images to generate videos with semantically similar motions. Our approach does not require spatial alignment between the motion reference video and target image, generalizes across various domains, and can be applied to various tasks such as full-body and face reenactment, as well as controlling the motion of inanimate objects and the camera. We empirically demonstrate the effectiveness of our method in the semantic video motion transfer task, significantly outperforming existing methods in this context.