MultiMotion: Multi Subject Video Motion Transfer via Video Diffusion Transformer

📅 2025-12-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address motion entanglement and lack of object-level control in DiT-based multi-object video motion transfer, this paper proposes Mask-aware Attention Motion Flow (MAMF), which leverages SAM2-generated object masks to disentangle motion features, and introduces RectPC—a higher-order predictor-corrector sampler—for refined motion flow estimation. Integrated deeply within the Video Diffusion Transformer framework, MAMF enables fine-grained, semantically aligned motion transfer across multiple subjects. Evaluated on a newly constructed multi-object motion transfer benchmark, our method significantly improves temporal coherence and spatial accuracy. While preserving the high-fidelity generation capability of DiTs, it achieves, for the first time, scalable and controllable precise motion transfer for multiple characters. This work establishes a novel paradigm for object-level motion modeling in video diffusion models, advancing both controllability and structural fidelity in generative video synthesis.

Technology Category

Application Category

📝 Abstract
Multi-object video motion transfer poses significant challenges for Diffusion Transformer (DiT) architectures due to inherent motion entanglement and lack of object-level control. We present MultiMotion, a novel unified framework that overcomes these limitations. Our core innovation is Maskaware Attention Motion Flow (AMF), which utilizes SAM2 masks to explicitly disentangle and control motion features for multiple objects within the DiT pipeline. Furthermore, we introduce RectPC, a high-order predictor-corrector solver for efficient and accurate sampling, particularly beneficial for multi-entity generation. To facilitate rigorous evaluation, we construct the first benchmark dataset specifically for DiT-based multi-object motion transfer. MultiMotion demonstrably achieves precise, semantically aligned, and temporally coherent motion transfer for multiple distinct objects, maintaining DiT's high quality and scalability. The code is in the supp.
Problem

Research questions and friction points this paper is trying to address.

Disentangles motion features for multiple objects in videos
Enables precise control over individual object motion transfer
Addresses lack of benchmark for multi-object motion evaluation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Mask-aware Attention Motion Flow disentangles object motion
RectPC solver enables efficient multi-entity sampling
First benchmark dataset for multi-object motion transfer
🔎 Similar Papers
No similar papers found.
P
Penghui Liu
College of Information Science and Technology, Beijing University of Technology, Beijing, China
Jiangshan Wang
Jiangshan Wang
Tsinghua University
computer visiondeep learning
Y
Yutong Shen
Independent Researcher
S
Shanhui Mo
Independent Researcher
Chenyang Qi
Chenyang Qi
HKUST
Computer Vision
Yue Ma
Yue Ma
Bytedance
NLPDialogue SystemLLM