CINEMA: Coherent Multi-Subject Video Generation via MLLM-Based Guidance

📅 2025-03-13

📈 Citations: 0

✨ Influential: 0

career value

204K/year

🤖 AI Summary

Existing video generation methods suffer from spatiotemporal inconsistency and identity confusion in multi-subject personalized video synthesis, primarily due to reliance on text-keyword alignment with reference images—leading to ambiguous subject relationship modeling and poor scalability. This paper introduces the first multimodal large language model (MLLM)-based framework for implicit subject relationship modeling, eliminating the need for text alignment or manual annotations. It enables arbitrary multi-subject video generation directly from multiple independent reference images. Our approach integrates diffusion models with MLLM-guided conditioning, cross-subject feature disentanglement, and spatiotemporal consistency constraints. Experiments demonstrate substantial improvements in subject identity preservation and video spatiotemporal coherence, outperforming state-of-the-art methods in both qualitative and quantitative evaluations. The framework establishes a new paradigm for personalized narrative and interactive media generation.

Technology Category

Application Category

📝 Abstract

Video generation has witnessed remarkable progress with the advent of deep generative models, particularly diffusion models. While existing methods excel in generating high-quality videos from text prompts or single images, personalized multi-subject video generation remains a largely unexplored challenge. This task involves synthesizing videos that incorporate multiple distinct subjects, each defined by separate reference images, while ensuring temporal and spatial consistency. Current approaches primarily rely on mapping subject images to keywords in text prompts, which introduces ambiguity and limits their ability to model subject relationships effectively. In this paper, we propose CINEMA, a novel framework for coherent multi-subject video generation by leveraging Multimodal Large Language Model (MLLM). Our approach eliminates the need for explicit correspondences between subject images and text entities, mitigating ambiguity and reducing annotation effort. By leveraging MLLM to interpret subject relationships, our method facilitates scalability, enabling the use of large and diverse datasets for training. Furthermore, our framework can be conditioned on varying numbers of subjects, offering greater flexibility in personalized content creation. Through extensive evaluations, we demonstrate that our approach significantly improves subject consistency, and overall video coherence, paving the way for advanced applications in storytelling, interactive media, and personalized video generation.

Problem

Research questions and friction points this paper is trying to address.

Personalized multi-subject video generation challenge

Ensuring temporal and spatial consistency in videos

Mitigating ambiguity in subject-image-text correspondence

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses MLLM for multi-subject video generation

Eliminates need for text-image correspondences

Enhances scalability with diverse training datasets

🔎 Similar Papers

MovieLLM: Enhancing Long Video Understanding with AI-Generated Movies