CustomVideoX: 3D Reference Attention Driven Dynamic Adaptation for Zero-Shot Customized Video Diffusion Transformers

📅 2025-02-10

📈 Citations: 0

✨ Influential: 0

career value

187K/year

🤖 AI Summary

Personalized video generation faces two major challenges: temporal inconsistency and quality degradation. This paper introduces the first zero-shot reference-image-driven framework for personalized video generation, capable of producing high-fidelity, temporally consistent customized videos from a single reference image and a text prompt. Methodologically, it integrates a 3D reference attention mechanism for spatio-temporal joint modeling across frames; a temporal-aware reference bias (TAB) module and an entity-region-aware enhancement (ERAE) module to improve dynamic consistency and precise alignment of key semantic regions. Additionally, we construct VideoBench—the first dedicated benchmark for evaluating personalized video generation. Built upon video diffusion Transformers and efficient LoRA fine-tuning, our method achieves state-of-the-art performance on VideoBench, improving inter-frame consistency by 32% and significantly enhancing detail fidelity and text–vision alignment.

Technology Category

Application Category

📝 Abstract

Customized generation has achieved significant progress in image synthesis, yet personalized video generation remains challenging due to temporal inconsistencies and quality degradation. In this paper, we introduce CustomVideoX, an innovative framework leveraging the video diffusion transformer for personalized video generation from a reference image. CustomVideoX capitalizes on pre-trained video networks by exclusively training the LoRA parameters to extract reference features, ensuring both efficiency and adaptability. To facilitate seamless interaction between the reference image and video content, we propose 3D Reference Attention, which enables direct and simultaneous engagement of reference image features with all video frames across spatial and temporal dimensions. To mitigate the excessive influence of reference image features and textual guidance on generated video content during inference, we implement the Time-Aware Reference Attention Bias (TAB) strategy, dynamically modulating reference bias over different time steps. Additionally, we introduce the Entity Region-Aware Enhancement (ERAE) module, aligning highly activated regions of key entity tokens with reference feature injection by adjusting attention bias. To thoroughly evaluate personalized video generation, we establish a new benchmark, VideoBench, comprising over 50 objects and 100 prompts for extensive assessment. Experimental results show that CustomVideoX significantly outperforms existing methods in terms of video consistency and quality.

Problem

Research questions and friction points this paper is trying to address.

personalized video generation

temporal inconsistencies

reference image features

Innovation

Methods, ideas, or system contributions that make the work stand out.

LoRA parameters for feature extraction

3D Reference Attention for video frames

Time-Aware Reference Attention Bias strategy

🔎 Similar Papers

CustomCrafter: Customized Video Generation with Preserving Motion and Concept Composition Abilities

2024-08-23arXiv.orgCitations: 5