Subject-driven Video Generation via Disentangled Identity and Motion

📅 2025-04-23

📈 Citations: 0

✨ Influential: 0

career value

181K/year

🤖 AI Summary

Zero-shot subject-driven video generation suffers from reliance on large-scale annotated videos, high computational costs, and poor subject consistency. Method: This paper proposes a fine-tuning-free image-to-video customization framework based on disentangled modeling of identity and motion: subject identity is injected via a few reference images, while temporal dynamics are learned from unlabeled videos; it introduces direct training of video diffusion models on image-customization data, incorporates random image token dropping and random initialization to mitigate overfitting and “copy-paste” artifacts, and designs a stochastic switching mechanism for joint optimization of subject and motion features to alleviate catastrophic forgetting. Contribution/Results: Experiments demonstrate that the method significantly improves cross-scene subject consistency and scalability under zero-shot settings, outperforming state-of-the-art video customization approaches.

Technology Category

Application Category

📝 Abstract

We propose to train a subject-driven customized video generation model through decoupling the subject-specific learning from temporal dynamics in zero-shot without additional tuning. A traditional method for video customization that is tuning-free often relies on large, annotated video datasets, which are computationally expensive and require extensive annotation. In contrast to the previous approach, we introduce the use of an image customization dataset directly on training video customization models, factorizing the video customization into two folds: (1) identity injection through image customization dataset and (2) temporal modeling preservation with a small set of unannotated videos through the image-to-video training method. Additionally, we employ random image token dropping with randomized image initialization during image-to-video fine-tuning to mitigate the copy-and-paste issue. To further enhance learning, we introduce stochastic switching during joint optimization of subject-specific and temporal features, mitigating catastrophic forgetting. Our method achieves strong subject consistency and scalability, outperforming existing video customization models in zero-shot settings, demonstrating the effectiveness of our framework.

Problem

Research questions and friction points this paper is trying to address.

Decoupling subject-specific learning from temporal dynamics for video generation

Using image datasets to train video models without large annotated videos

Mitigating copy-and-paste issues in subject-driven video generation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Decoupling identity and motion for zero-shot video generation

Using image dataset for video model training directly

Random token dropping and stochastic switching optimization

🔎 Similar Papers

No similar papers found.