Phantom: Subject-consistent video generation via cross-modal alignment

📅 2025-02-16

📈 Citations: 0

✨ Influential: 0

career value

186K/year

🤖 AI Summary

This work addresses the problem of subject-consistent video generation jointly driven by text and image inputs. We propose the first text–image–video tri-modal co-injection architecture. Methodologically, we reformulate the dual-modality prompt injection mechanism by integrating contrastive learning and feature disentanglement to achieve deep cross-modal alignment between textual semantics and visual subject attributes—including identity (ID), appearance, and motion. Our framework unifies modeling for both single- and multi-subject scenarios, significantly improving inter-frame consistency and semantic controllability while preserving subject ID. Evaluated on the Subject-to-Video benchmark, our approach achieves state-of-the-art performance, boosting subject ID retention by 37%. It enables high-fidelity subject reuse, multi-character collaborative generation, and fine-grained textual editing—demonstrating superior control and fidelity in subject-driven video synthesis.

Technology Category

Application Category

📝 Abstract

The continuous development of foundational models for video generation is evolving into various applications, with subject-consistent video generation still in the exploratory stage. We refer to this as Subject-to-Video, which extracts subject elements from reference images and generates subject-consistent video through textual instructions. We believe that the essence of subject-to-video lies in balancing the dual-modal prompts of text and image, thereby deeply and simultaneously aligning both text and visual content. To this end, we propose Phantom, a unified video generation framework for both single and multi-subject references. Building on existing text-to-video and image-to-video architectures, we redesign the joint text-image injection model and drive it to learn cross-modal alignment via text-image-video triplet data. In particular, we emphasize subject consistency in human generation, covering existing ID-preserving video generation while offering enhanced advantages. The project homepage is here https://phantom-video.github.io/Phantom/.

Problem

Research questions and friction points this paper is trying to address.

Subject-consistent video generation from images

Cross-modal alignment of text and visual content

Enhanced ID-preserving human video generation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Cross-modal alignment

Text-image-video triplet

Subject-consistent video generation

🔎 Similar Papers

A Simple but Strong Baseline for Sounding Video Generation: Effective Adaptation of Audio and Video Diffusion Models for Joint Generation

2024-09-26arXiv.orgCitations: 4