MotionAdapter: Video Motion Transfer via Content-Aware Attention Customization

📅 2026-01-05

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

183K/year

🤖 AI Summary

Existing diffusion-based text-to-video models struggle to disentangle motion from appearance in complex motion transfer and often suffer from insufficient semantic alignment. This work proposes MotionAdapter, a framework that introduces content-aware attention mechanisms into the DiT architecture, achieving explicit motion-appearance disentanglement for the first time. By employing a 3D full-attention module to extract motion fields and integrating DINO feature-guided motion field rearrangement, the method precisely injects motion from a reference video during the denoising process. This approach enables high-fidelity, semantically consistent complex motion transfer and editing—including operations such as scaling—and demonstrates superior performance over state-of-the-art methods in both qualitative and quantitative evaluations.

Technology Category

Application Category

📝 Abstract

Recent advances in diffusion-based text-to-video models, particularly those built on the diffusion transformer architecture, have achieved remarkable progress in generating high-quality and temporally coherent videos. However, transferring complex motions between videos remains challenging. In this work, we present MotionAdapter, a content-aware motion transfer framework that enables robust and semantically aligned motion transfer within DiT-based T2V models. Our key insight is that effective motion transfer requires \romannumeral1) explicit disentanglement of motion from appearance and \romannumeral 2) adaptive customization of motion to target content. MotionAdapter first isolates motion by analyzing cross-frame attention within 3D full-attention modules to extract attention-derived motion fields. To bridge the semantic gap between reference and target videos, we further introduce a DINO-guided motion customization module that rearranges and refines motion fields based on content correspondences. The customized motion field is then used to guide the DiT denoising process, ensuring that the synthesized video inherits the reference motion while preserving target appearance and semantics. Extensive experiments demonstrate that MotionAdapter outperforms state-of-the-art methods in both qualitative and quantitative evaluations. Moreover, MotionAdapter naturally supports complex motion transfer and motion editing tasks such as zooming.

Problem

Research questions and friction points this paper is trying to address.

motion transfer

video generation

diffusion models

content-aware adaptation

semantic alignment

Innovation

Methods, ideas, or system contributions that make the work stand out.

motion transfer

content-aware attention

diffusion transformer