🤖 AI Summary
This work addresses the challenge of identity-consistent and motion-natural personalized video generation from multiple reference images. We propose a three-module framework—Face Feature Extractor, Multi-Scale Projector, and ID Router—that introduces the first multi-ID collaborative injection mechanism tailored for video diffusion Transformers. This mechanism enables cross-scale facial feature alignment and dynamic spatiotemporal ID routing. Our approach employs multi-stage supervised training and a custom text-video dataset to jointly enhance identity fidelity, motion coherence, and text-video alignment accuracy. Qualitative evaluation demonstrates significant improvements over existing identity-customized video generation methods in identity preservation, motion naturalness, and controllability. All code, models, and data are publicly released.
📝 Abstract
This paper presents a powerful framework to customize video creations by incorporating multiple specific identity (ID) photos, with video diffusion Transformers, referred to as exttt{Ingredients}. Generally, our method consists of three primary modules: ( extbf{i}) a facial extractor that captures versatile and precise facial features for each human ID from both global and local perspectives; ( extbf{ii}) a multi-scale projector that maps face embeddings into the contextual space of image query in video diffusion transformers; ( extbf{iii}) an ID router that dynamically combines and allocates multiple ID embedding to the corresponding space-time regions. Leveraging a meticulously curated text-video dataset and a multi-stage training protocol, exttt{Ingredients} demonstrates superior performance in turning custom photos into dynamic and personalized video content. Qualitative evaluations highlight the advantages of proposed method, positioning it as a significant advancement toward more effective generative video control tools in Transformer-based architecture, compared to existing methods. The data, code, and model weights are publicly available at: url{https://github.com/feizc/Ingredients}.