Identity-Preserving Text-to-Video Generation via Training-Free Prompt, Image, and Guidance Enhancement

📅 2025-09-01

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Addressing the challenges of identity preservation, reliance on fine-tuning, and data scarcity in text-to-video generation, this paper proposes a training-free triple-enhancement framework. First, GPT-4o–driven face-aware prompt enhancement bridges the semantic gap between textual descriptions and visual content. Second, a prompt-aware reference image optimization mechanism improves input consistency. Third, a unified gradient-guided strategy jointly optimizes identity fidelity and spatiotemporal coherence during diffusion model sampling—enabling inference-time refinement without architectural modification. The method requires no model training or fine-tuning. Extensive evaluation on a thousand-video benchmark demonstrates significant improvements in character identity consistency and video quality, outperforming state-of-the-art approaches in both automated metrics and human assessment. It ranked first in the ACM Multimedia 2025 Challenge, validating its strong generalizability and practical applicability.

Technology Category

Application Category

📝 Abstract

Identity-preserving text-to-video (IPT2V) generation creates videos faithful to both a reference subject image and a text prompt. While fine-tuning large pretrained video diffusion models on ID-matched data achieves state-of-the-art results on IPT2V, data scarcity and high tuning costs hinder broader improvement. We thus introduce a Training-Free Prompt, Image, and Guidance Enhancement (TPIGE) framework that bridges the semantic gap between the video description and the reference image and design sampling guidance that enhances identity preservation and video quality, achieving performance gains at minimal cost.Specifically, we first propose Face Aware Prompt Enhancement, using GPT-4o to enhance the text prompt with facial details derived from the reference image. We then propose Prompt Aware Reference Image Enhancement, leveraging an identity-preserving image generator to refine the reference image, rectifying conflicts with the text prompt. The above mutual refinement significantly improves input quality before video generation. Finally, we propose ID-Aware Spatiotemporal Guidance Enhancement, utilizing unified gradients to optimize identity preservation and video quality jointly during generation.Our method outperforms prior work and is validated by automatic and human evaluations on a 1000 video test set, winning first place in the ACM Multimedia 2025 Identity-Preserving Video Generation Challenge, demonstrating state-of-the-art performance and strong generality. The code is available at https://github.com/Andyplus1/IPT2V.git.

Problem

Research questions and friction points this paper is trying to address.

Bridging semantic gap between video description and reference image

Enhancing identity preservation without costly fine-tuning

Improving video quality while maintaining subject fidelity

Innovation

Methods, ideas, or system contributions that make the work stand out.

Training-free framework enhances text and image inputs

GPT-4o augments prompts with facial details from reference

Unified gradients optimize identity preservation during video generation

🔎 Similar Papers

ByTheWay: Boost Your Text-to-Video Generation Model to Higher Quality in a Training-free Way

2024-10-08Citations: 0

Authors to Follow