VideoMage: Multi-Subject and Motion Customization of Text-to-Video Diffusion Models

📅 2025-03-27

📈 Citations: 0

✨ Influential: 0

career value

206K/year

🤖 AI Summary

Existing text-to-video methods primarily focus on single-subject personalization, struggling to jointly customize multiple subjects’ identities and their interactive motions. To address this, we propose the first generative framework enabling dual-dimensional customization—of multiple subjects and their interactive actions. Our approach leverages user-uploaded images to define subject appearances and uploaded videos to extract interaction motions; it employs appearance-agnostic motion learning and spatiotemporal composition strategies to achieve motion-appearance disentanglement and precise inter-subject interaction control. We further introduce dual LoRA adapters—subject-specific and motion-specific—and a spatial-temporal guided diffusion sampling schedule. Experiments demonstrate that our method significantly outperforms state-of-the-art approaches on multi-subject video generation, producing outputs with high identity fidelity, natural and temporally coherent motion sequences, and physically plausible interaction semantics. Both qualitative and quantitative evaluations confirm superior performance.

Technology Category

Application Category

📝 Abstract

Customized text-to-video generation aims to produce high-quality videos that incorporate user-specified subject identities or motion patterns. However, existing methods mainly focus on personalizing a single concept, either subject identity or motion pattern, limiting their effectiveness for multiple subjects with the desired motion patterns. To tackle this challenge, we propose a unified framework VideoMage for video customization over both multiple subjects and their interactive motions. VideoMage employs subject and motion LoRAs to capture personalized content from user-provided images and videos, along with an appearance-agnostic motion learning approach to disentangle motion patterns from visual appearance. Furthermore, we develop a spatial-temporal composition scheme to guide interactions among subjects within the desired motion patterns. Extensive experiments demonstrate that VideoMage outperforms existing methods, generating coherent, user-controlled videos with consistent subject identities and interactions.

Problem

Research questions and friction points this paper is trying to address.

Customizing text-to-video for multiple subjects and motions

Disentangling motion patterns from visual appearance

Guiding subject interactions within desired motion patterns

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified framework for multi-subject and motion customization

Subject and motion LoRAs for personalized content capture

Spatial-temporal composition for guided subject interactions

🔎 Similar Papers

CustomVideo: Customizing Text-to-Video Generation with Multiple Subjects