🤖 AI Summary
To address motion entanglement and lack of object-level control in DiT-based multi-object video motion transfer, this paper proposes Mask-aware Attention Motion Flow (MAMF), which leverages SAM2-generated object masks to disentangle motion features, and introduces RectPC—a higher-order predictor-corrector sampler—for refined motion flow estimation. Integrated deeply within the Video Diffusion Transformer framework, MAMF enables fine-grained, semantically aligned motion transfer across multiple subjects. Evaluated on a newly constructed multi-object motion transfer benchmark, our method significantly improves temporal coherence and spatial accuracy. While preserving the high-fidelity generation capability of DiTs, it achieves, for the first time, scalable and controllable precise motion transfer for multiple characters. This work establishes a novel paradigm for object-level motion modeling in video diffusion models, advancing both controllability and structural fidelity in generative video synthesis.
📝 Abstract
Multi-object video motion transfer poses significant challenges for Diffusion Transformer (DiT) architectures due to inherent motion entanglement and lack of object-level control. We present MultiMotion, a novel unified framework that overcomes these limitations. Our core innovation is Maskaware Attention Motion Flow (AMF), which utilizes SAM2 masks to explicitly disentangle and control motion features for multiple objects within the DiT pipeline. Furthermore, we introduce RectPC, a high-order predictor-corrector solver for efficient and accurate sampling, particularly beneficial for multi-entity generation. To facilitate rigorous evaluation, we construct the first benchmark dataset specifically for DiT-based multi-object motion transfer. MultiMotion demonstrably achieves precise, semantically aligned, and temporally coherent motion transfer for multiple distinct objects, maintaining DiT's high quality and scalability. The code is in the supp.