🤖 AI Summary
Existing text-to-video methods primarily focus on single-subject personalization, struggling to jointly customize multiple subjects’ identities and their interactive motions. To address this, we propose the first generative framework enabling dual-dimensional customization—of multiple subjects and their interactive actions. Our approach leverages user-uploaded images to define subject appearances and uploaded videos to extract interaction motions; it employs appearance-agnostic motion learning and spatiotemporal composition strategies to achieve motion-appearance disentanglement and precise inter-subject interaction control. We further introduce dual LoRA adapters—subject-specific and motion-specific—and a spatial-temporal guided diffusion sampling schedule. Experiments demonstrate that our method significantly outperforms state-of-the-art approaches on multi-subject video generation, producing outputs with high identity fidelity, natural and temporally coherent motion sequences, and physically plausible interaction semantics. Both qualitative and quantitative evaluations confirm superior performance.
📝 Abstract
Customized text-to-video generation aims to produce high-quality videos that incorporate user-specified subject identities or motion patterns. However, existing methods mainly focus on personalizing a single concept, either subject identity or motion pattern, limiting their effectiveness for multiple subjects with the desired motion patterns. To tackle this challenge, we propose a unified framework VideoMage for video customization over both multiple subjects and their interactive motions. VideoMage employs subject and motion LoRAs to capture personalized content from user-provided images and videos, along with an appearance-agnostic motion learning approach to disentangle motion patterns from visual appearance. Furthermore, we develop a spatial-temporal composition scheme to guide interactions among subjects within the desired motion patterns. Extensive experiments demonstrate that VideoMage outperforms existing methods, generating coherent, user-controlled videos with consistent subject identities and interactions.