Identity-Preserving Text-to-Video Generation via Training-Free Prompt, Image, and Guidance Enhancement

📅 2025-09-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Addressing the challenges of identity preservation, reliance on fine-tuning, and data scarcity in text-to-video generation, this paper proposes a training-free triple-enhancement framework. First, GPT-4o–driven face-aware prompt enhancement bridges the semantic gap between textual descriptions and visual content. Second, a prompt-aware reference image optimization mechanism improves input consistency. Third, a unified gradient-guided strategy jointly optimizes identity fidelity and spatiotemporal coherence during diffusion model sampling—enabling inference-time refinement without architectural modification. The method requires no model training or fine-tuning. Extensive evaluation on a thousand-video benchmark demonstrates significant improvements in character identity consistency and video quality, outperforming state-of-the-art approaches in both automated metrics and human assessment. It ranked first in the ACM Multimedia 2025 Challenge, validating its strong generalizability and practical applicability.

Technology Category

Application Category

📝 Abstract
Identity-preserving text-to-video (IPT2V) generation creates videos faithful to both a reference subject image and a text prompt. While fine-tuning large pretrained video diffusion models on ID-matched data achieves state-of-the-art results on IPT2V, data scarcity and high tuning costs hinder broader improvement. We thus introduce a Training-Free Prompt, Image, and Guidance Enhancement (TPIGE) framework that bridges the semantic gap between the video description and the reference image and design sampling guidance that enhances identity preservation and video quality, achieving performance gains at minimal cost.Specifically, we first propose Face Aware Prompt Enhancement, using GPT-4o to enhance the text prompt with facial details derived from the reference image. We then propose Prompt Aware Reference Image Enhancement, leveraging an identity-preserving image generator to refine the reference image, rectifying conflicts with the text prompt. The above mutual refinement significantly improves input quality before video generation. Finally, we propose ID-Aware Spatiotemporal Guidance Enhancement, utilizing unified gradients to optimize identity preservation and video quality jointly during generation.Our method outperforms prior work and is validated by automatic and human evaluations on a 1000 video test set, winning first place in the ACM Multimedia 2025 Identity-Preserving Video Generation Challenge, demonstrating state-of-the-art performance and strong generality. The code is available at https://github.com/Andyplus1/IPT2V.git.
Problem

Research questions and friction points this paper is trying to address.

Bridging semantic gap between video description and reference image
Enhancing identity preservation without costly fine-tuning
Improving video quality while maintaining subject fidelity
Innovation

Methods, ideas, or system contributions that make the work stand out.

Training-free framework enhances text and image inputs
GPT-4o augments prompts with facial details from reference
Unified gradients optimize identity preservation during video generation
J
Jiayi Gao
Wangxuan Institute of Computer Technology, Peking University, Beijing, China
C
Changcheng Hua
Wangxuan Institute of Computer Technology, Peking University, Beijing, China
Qingchao Chen
Qingchao Chen
Assistant Professor, Peking University
Transfer LearningMedical Data AnalysisMulti-modal Human SensingRadar Systems
Y
Yuxin Peng
Wangxuan Institute of Computer Technology, Peking University, Beijing, China
Y
Yang Liu
Wangxuan Institute of Computer Technology, Peking University, Beijing, China