OmniVCus: Feedforward Subject-driven Video Customization with Multimodal Control Conditions

📅 2025-06-29

📈 Citations: 0

✨ Influential: 0

career value

184K/year

🤖 AI Summary

Existing methods struggle to support multi-subject video customization and fine-grained multimodal control. To address this, we propose VideoCus-Factory, a novel data construction pipeline that automatically extracts temporally aligned control signals—including depth maps, segmentation masks, and camera parameters—from unlabeled raw videos. We introduce Lottery Embedding and a temporal alignment embedding mechanism to enhance multi-subject generalization and ensure condition–generation temporal consistency. Built upon a diffusion Transformer architecture, our model is trained end-to-end using both image editing data and multimodal conditioning inputs (text, depth, masks, and camera parameters). Extensive experiments demonstrate that our approach significantly outperforms state-of-the-art methods in both quantitative metrics and qualitative evaluation. To the best of our knowledge, it is the first method to achieve high-fidelity, multi-subject, and multi-condition controllable video customization.

Technology Category

Application Category

📝 Abstract

Existing feedforward subject-driven video customization methods mainly study single-subject scenarios due to the difficulty of constructing multi-subject training data pairs. Another challenging problem that how to use the signals such as depth, mask, camera, and text prompts to control and edit the subject in the customized video is still less explored. In this paper, we first propose a data construction pipeline, VideoCus-Factory, to produce training data pairs for multi-subject customization from raw videos without labels and control signals such as depth-to-video and mask-to-video pairs. Based on our constructed data, we develop an Image-Video Transfer Mixed (IVTM) training with image editing data to enable instructive editing for the subject in the customized video. Then we propose a diffusion Transformer framework, OmniVCus, with two embedding mechanisms, Lottery Embedding (LE) and Temporally Aligned Embedding (TAE). LE enables inference with more subjects by using the training subjects to activate more frame embeddings. TAE encourages the generation process to extract guidance from temporally aligned control signals by assigning the same frame embeddings to the control and noise tokens. Experiments demonstrate that our method significantly surpasses state-of-the-art methods in both quantitative and qualitative evaluations. Video demos are at our project page: https://caiyuanhao1998.github.io/project/OmniVCus/. Our code will be released at https://github.com/caiyuanhao1998/Open-OmniVCus

Problem

Research questions and friction points this paper is trying to address.

Overcoming single-subject limitations in video customization

Integrating multimodal signals for video control and editing

Enabling multi-subject training without labeled data

Innovation

Methods, ideas, or system contributions that make the work stand out.

VideoCus-Factory pipeline for multi-subject data

Image-Video Transfer Mixed training method

Diffusion Transformer with Lottery and Aligned Embeddings

🔎 Similar Papers

CustomVideo: Customizing Text-to-Video Generation with Multiple Subjects