MultiMotion: Multi Subject Video Motion Transfer via Video Diffusion Transformer

📅 2025-12-08

📈 Citations: 0

✨ Influential: 0

career value

187K/year

🤖 AI Summary

To address motion entanglement and lack of object-level control in DiT-based multi-object video motion transfer, this paper proposes Mask-aware Attention Motion Flow (MAMF), which leverages SAM2-generated object masks to disentangle motion features, and introduces RectPC—a higher-order predictor-corrector sampler—for refined motion flow estimation. Integrated deeply within the Video Diffusion Transformer framework, MAMF enables fine-grained, semantically aligned motion transfer across multiple subjects. Evaluated on a newly constructed multi-object motion transfer benchmark, our method significantly improves temporal coherence and spatial accuracy. While preserving the high-fidelity generation capability of DiTs, it achieves, for the first time, scalable and controllable precise motion transfer for multiple characters. This work establishes a novel paradigm for object-level motion modeling in video diffusion models, advancing both controllability and structural fidelity in generative video synthesis.

Technology Category

Application Category

📝 Abstract

Multi-object video motion transfer poses significant challenges for Diffusion Transformer (DiT) architectures due to inherent motion entanglement and lack of object-level control. We present MultiMotion, a novel unified framework that overcomes these limitations. Our core innovation is Maskaware Attention Motion Flow (AMF), which utilizes SAM2 masks to explicitly disentangle and control motion features for multiple objects within the DiT pipeline. Furthermore, we introduce RectPC, a high-order predictor-corrector solver for efficient and accurate sampling, particularly beneficial for multi-entity generation. To facilitate rigorous evaluation, we construct the first benchmark dataset specifically for DiT-based multi-object motion transfer. MultiMotion demonstrably achieves precise, semantically aligned, and temporally coherent motion transfer for multiple distinct objects, maintaining DiT's high quality and scalability. The code is in the supp.

Problem

Research questions and friction points this paper is trying to address.

Disentangles motion features for multiple objects in videos

Enables precise control over individual object motion transfer

Addresses lack of benchmark for multi-object motion evaluation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Mask-aware Attention Motion Flow disentangles object motion

RectPC solver enables efficient multi-entity sampling

First benchmark dataset for multi-object motion transfer

🔎 Similar Papers

Reenact Anything: Semantic Video Motion Transfer Using Motion-Textual Inversion