๐ค AI Summary
Existing video generation methods suffer from spatiotemporal inconsistency and identity confusion in multi-subject personalized video synthesis, primarily due to reliance on text-keyword alignment with reference imagesโleading to ambiguous subject relationship modeling and poor scalability. This paper introduces the first multimodal large language model (MLLM)-based framework for implicit subject relationship modeling, eliminating the need for text alignment or manual annotations. It enables arbitrary multi-subject video generation directly from multiple independent reference images. Our approach integrates diffusion models with MLLM-guided conditioning, cross-subject feature disentanglement, and spatiotemporal consistency constraints. Experiments demonstrate substantial improvements in subject identity preservation and video spatiotemporal coherence, outperforming state-of-the-art methods in both qualitative and quantitative evaluations. The framework establishes a new paradigm for personalized narrative and interactive media generation.
๐ Abstract
Video generation has witnessed remarkable progress with the advent of deep generative models, particularly diffusion models. While existing methods excel in generating high-quality videos from text prompts or single images, personalized multi-subject video generation remains a largely unexplored challenge. This task involves synthesizing videos that incorporate multiple distinct subjects, each defined by separate reference images, while ensuring temporal and spatial consistency. Current approaches primarily rely on mapping subject images to keywords in text prompts, which introduces ambiguity and limits their ability to model subject relationships effectively. In this paper, we propose CINEMA, a novel framework for coherent multi-subject video generation by leveraging Multimodal Large Language Model (MLLM). Our approach eliminates the need for explicit correspondences between subject images and text entities, mitigating ambiguity and reducing annotation effort. By leveraging MLLM to interpret subject relationships, our method facilitates scalability, enabling the use of large and diverse datasets for training. Furthermore, our framework can be conditioned on varying numbers of subjects, offering greater flexibility in personalized content creation. Through extensive evaluations, we demonstrate that our approach significantly improves subject consistency, and overall video coherence, paving the way for advanced applications in storytelling, interactive media, and personalized video generation.