🤖 AI Summary
Existing image-to-video motion transfer methods struggle to assign distinct motions to multiple objects within a single image, often resulting in motion entanglement. This work proposes FlexiMMT, the first framework to enable explicit multi-object, multi-motion transfer. By integrating object-specific mask constraints into diffusion models through a motion-decoupled mask attention mechanism and a differentiated mask propagation strategy, FlexiMMT effectively disentangles motions across objects. Furthermore, it leverages an attention-derived per-frame mask generation technique to support flexible pairing of arbitrary motions with individual objects. The method achieves precise, composable, and state-of-the-art performance in multi-object, multi-motion transfer tasks, significantly mitigating cross-object motion interference.
📝 Abstract
Motion transfer has emerged as a promising direction for controllable video generation, yet existing methods largely focus on single-object scenarios and struggle when multiple objects require distinct motion patterns. In this work, we present FlexiMMT, the first implicit image-to-video (I2V) motion transfer framework that explicitly enables multi-object, multi-motion transfer. Given a static multi-object image and multiple reference videos, FlexiMMT independently extracts motion representations and accurately assigns them to different objects, supporting flexible recombination and arbitrary motion-to-object mappings. To address the core challenge of cross-object motion entanglement, we introduce a Motion Decoupled Mask Attention Mechanism that uses object-specific masks to constrain attention, ensuring that motion and text tokens only influence their designated regions. We further propose a Differentiated Mask Propagation Mechanism that derives object-specific masks directly from diffusion attention and progressively propagates them across frames efficiently. Extensive experiments demonstrate that FlexiMMT achieves precise, compositional, and state-of-the-art performance in I2V-based multi-object multi-motion transfer.