PolyVivid: Vivid Multi-Subject Video Generation with Cross-Modal Interaction and Enhancement

📅 2025-06-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address weak identity consistency, unnatural inter-subject interactions, and poor fine-grained text-image alignment in multi-subject video generation, this work proposes the first high-fidelity multi-subject customization framework. Methodologically, it introduces: (i) a VLLM-driven text-image fusion module for semantic grounding; (ii) a structured bidirectional enhancement mechanism leveraging 3D-RoPE for spatiotemporal coherence; (iii) an attention-inheritance-based identity injection module to preserve subject-specific appearance across frames; and (iv) an MLLM-powered clique-consolidated multi-subject data pipeline for scalable, diverse training. Experiments demonstrate state-of-the-art performance across identity fidelity, video photorealism, and subject-text alignment—surpassing both leading open-source and commercial models. The framework significantly mitigates identity drift, enhances subject distinguishability, and improves the naturalness of dynamic inter-subject interactions, establishing a new paradigm for controllable multi-subject video generation.

Technology Category

Application Category

📝 Abstract
Despite recent advances in video generation, existing models still lack fine-grained controllability, especially for multi-subject customization with consistent identity and interaction. In this paper, we propose PolyVivid, a multi-subject video customization framework that enables flexible and identity-consistent generation. To establish accurate correspondences between subject images and textual entities, we design a VLLM-based text-image fusion module that embeds visual identities into the textual space for precise grounding. To further enhance identity preservation and subject interaction, we propose a 3D-RoPE-based enhancement module that enables structured bidirectional fusion between text and image embeddings. Moreover, we develop an attention-inherited identity injection module to effectively inject fused identity features into the video generation process, mitigating identity drift. Finally, we construct an MLLM-based data pipeline that combines MLLM-based grounding, segmentation, and a clique-based subject consolidation strategy to produce high-quality multi-subject data, effectively enhancing subject distinction and reducing ambiguity in downstream video generation. Extensive experiments demonstrate that PolyVivid achieves superior performance in identity fidelity, video realism, and subject alignment, outperforming existing open-source and commercial baselines.
Problem

Research questions and friction points this paper is trying to address.

Lack of fine-grained controllability in multi-subject video generation
Difficulty in maintaining consistent identity and interaction in generated videos
Challenges in accurate subject-image and textual entity correspondence
Innovation

Methods, ideas, or system contributions that make the work stand out.

VLLM-based text-image fusion module
3D-RoPE-based enhancement module
attention-inherited identity injection module
🔎 Similar Papers
No similar papers found.
T
Teng Hu
Shanghai Jiao Tong University
Zhentao Yu
Zhentao Yu
Researcher, Tencent Hunyuan
Computer vision
Z
Zhengguang Zhou
Tencent Hunyuan
J
Jiangning Zhang
Zhejiang University
Y
Yuan Zhou
Tencent Hunyuan
Q
Qinglin Lu
Tencent Hunyuan
Ran Yi
Ran Yi
Associate Professor, Shanghai Jiao Tong University
Computer VisionComputer Graphics