CustomCrafter: Customized Video Generation with Preserving Motion and Concept Composition Abilities

๐Ÿ“… 2024-08-23
๐Ÿ›๏ธ arXiv.org
๐Ÿ“ˆ Citations: 5
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Current video generation models trained on a single subject image struggle to simultaneously achieve motion diversity and compositional generalization of subject concepts, while requiring frequent retraining with new video dataโ€”entailing high interactive overhead. To address this, we propose a fine-tuning-free, text-image jointly driven personalized video generation framework. Our method introduces a plug-and-play lightweight subject learning module, coupled with a dynamic weighted video sampling strategy that preserves motion priors in early denoising steps and emphasizes appearance reconstruction in later steps. We further incorporate parameter-efficient Video Diffusion Model (VDM) updates, phased weight scheduling, and cross-modal concept alignment. Extensive experiments demonstrate significant improvements over state-of-the-art methods across multiple benchmarks, achieving superior motion naturalness, subject fidelity, and flexibility in compositional concept generation. The source code is publicly available.

Technology Category

Application Category

๐Ÿ“ Abstract
Customized video generation aims to generate high-quality videos guided by text prompts and subject's reference images. However, since it is only trained on static images, the fine-tuning process of subject learning disrupts abilities of video diffusion models (VDMs) to combine concepts and generate motions. To restore these abilities, some methods use additional video similar to the prompt to fine-tune or guide the model. This requires frequent changes of guiding videos and even re-tuning of the model when generating different motions, which is very inconvenient for users. In this paper, we propose CustomCrafter, a novel framework that preserves the model's motion generation and conceptual combination abilities without additional video and fine-tuning to recovery. For preserving conceptual combination ability, we design a plug-and-play module to update few parameters in VDMs, enhancing the model's ability to capture the appearance details and the ability of concept combinations for new subjects. For motion generation, we observed that VDMs tend to restore the motion of video in the early stage of denoising, while focusing on the recovery of subject details in the later stage. Therefore, we propose Dynamic Weighted Video Sampling Strategy. Using the pluggability of our subject learning modules, we reduce the impact of this module on motion generation in the early stage of denoising, preserving the ability to generate motion of VDMs. In the later stage of denoising, we restore this module to repair the appearance details of the specified subject, thereby ensuring the fidelity of the subject's appearance. Experimental results show that our method has a significant improvement compared to previous methods. Code is available at https://github.com/WuTao-CS/CustomCrafter
Problem

Research questions and friction points this paper is trying to address.

Video Generation
Action Diversity
Training Efficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

CustomCrafter
Plug-and-Play Module
Dynamic Weighted Video Sampling
๐Ÿ”Ž Similar Papers
No similar papers found.
T
Tao Wu
College of Computer Science and Technology, Zhejiang University
Y
Yong Zhang
Tencent AI Lab
X
Xintao Wang
Tencent AI Lab, ARC Lab, Tencent PCG
Xianpan Zhou
Xianpan Zhou
Tencent
Computer Vision
Guangcong Zheng
Guangcong Zheng
College of Computer Science and Technology, Zhejiang University, Hangzhou 310027, China.
Controllable Video/Image SynthesisDiffusion ModelPersonalization Generation Multi-ModalBEV
Z
Zhongang Qi
ARC Lab, Tencent PCG
Ying Shan
Ying Shan
Distinguished Scientist at Tencent, Director of ARC Lab & AI Lab CVC
Deep learningcomputer visionmachine learningpaid searchdisplay ads
X
Xi Li
College of Computer Science and Technology, Zhejiang University