🤖 AI Summary
This work addresses key challenges in multi-subject video generation—weak identity consistency, cross-modal semantic misalignment, and poor temporal coherence—by proposing the first framework jointly driven by text prompts and multiple reference images. Methodologically, it introduces a hierarchical identity-preserving attention mechanism to explicitly model multi-subject identity features; leverages a pre-trained vision-language model (VLM) for fine-grained cross-modal semantic alignment; and integrates diffusion modeling with online reinforcement learning to dynamically optimize identity fidelity and temporal consistency. Extensive experiments across multiple benchmarks demonstrate that our approach significantly outperforms existing methods in subject identity preservation, semantic accuracy, and motion coherence. It substantially enhances controllability and visual realism in multi-subject video generation, establishing new state-of-the-art performance.
📝 Abstract
Video generative models pretrained on large-scale datasets can produce high-quality videos, but are often conditioned on text or a single image, limiting controllability and applicability. We introduce ID-Composer, a novel framework that addresses this gap by tackling multi-subject video generation from a text prompt and reference images. This task is challenging as it requires preserving subject identities, integrating semantics across subjects and modalities, and maintaining temporal consistency. To faithfully preserve the subject consistency and textual information in synthesized videos, ID-Composer designs a extbf{hierarchical identity-preserving attention mechanism}, which effectively aggregates features within and across subjects and modalities. To effectively allow for the semantic following of user intention, we introduce extbf{semantic understanding via pretrained vision-language model (VLM)}, leveraging VLM's superior semantic understanding to provide fine-grained guidance and capture complex interactions between multiple subjects. Considering that standard diffusion loss often fails in aligning the critical concepts like subject ID, we employ an extbf{online reinforcement learning phase} to drive the overall training objective of ID-Composer into RLVR. Extensive experiments demonstrate that our model surpasses existing methods in identity preservation, temporal consistency, and video quality.