MultiMotion: Multi Subject Video Motion Transfer via Video Diffusion Transformer

๐Ÿ“… 2025-12-08
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF

career value

219K/year
๐Ÿค– AI Summary
To address motion entanglement and lack of object-level control in DiT-based multi-object video motion transfer, this paper proposes Mask-aware Attention Motion Flow (MAMF), which leverages SAM2-generated object masks to disentangle motion features, and introduces RectPCโ€”a higher-order predictor-corrector samplerโ€”for refined motion flow estimation. Integrated deeply within the Video Diffusion Transformer framework, MAMF enables fine-grained, semantically aligned motion transfer across multiple subjects. Evaluated on a newly constructed multi-object motion transfer benchmark, our method significantly improves temporal coherence and spatial accuracy. While preserving the high-fidelity generation capability of DiTs, it achieves, for the first time, scalable and controllable precise motion transfer for multiple characters. This work establishes a novel paradigm for object-level motion modeling in video diffusion models, advancing both controllability and structural fidelity in generative video synthesis.

Technology Category

Application Category

๐Ÿ“ Abstract
Multi-object video motion transfer poses significant challenges for Diffusion Transformer (DiT) architectures due to inherent motion entanglement and lack of object-level control. We present MultiMotion, a novel unified framework that overcomes these limitations. Our core innovation is Maskaware Attention Motion Flow (AMF), which utilizes SAM2 masks to explicitly disentangle and control motion features for multiple objects within the DiT pipeline. Furthermore, we introduce RectPC, a high-order predictor-corrector solver for efficient and accurate sampling, particularly beneficial for multi-entity generation. To facilitate rigorous evaluation, we construct the first benchmark dataset specifically for DiT-based multi-object motion transfer. MultiMotion demonstrably achieves precise, semantically aligned, and temporally coherent motion transfer for multiple distinct objects, maintaining DiT's high quality and scalability. The code is in the supp.
Problem

Research questions and friction points this paper is trying to address.

Disentangles motion features for multiple objects in videos
Enables precise control over individual object motion transfer
Addresses lack of benchmark for multi-object motion evaluation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Mask-aware Attention Motion Flow disentangles object motion
RectPC solver enables efficient multi-entity sampling
First benchmark dataset for multi-object motion transfer
๐Ÿ”Ž Similar Papers