CustomVideoX: 3D Reference Attention Driven Dynamic Adaptation for Zero-Shot Customized Video Diffusion Transformers

📅 2025-02-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Personalized video generation faces two major challenges: temporal inconsistency and quality degradation. This paper introduces the first zero-shot reference-image-driven framework for personalized video generation, capable of producing high-fidelity, temporally consistent customized videos from a single reference image and a text prompt. Methodologically, it integrates a 3D reference attention mechanism for spatio-temporal joint modeling across frames; a temporal-aware reference bias (TAB) module and an entity-region-aware enhancement (ERAE) module to improve dynamic consistency and precise alignment of key semantic regions. Additionally, we construct VideoBench—the first dedicated benchmark for evaluating personalized video generation. Built upon video diffusion Transformers and efficient LoRA fine-tuning, our method achieves state-of-the-art performance on VideoBench, improving inter-frame consistency by 32% and significantly enhancing detail fidelity and text–vision alignment.

Technology Category

Application Category

📝 Abstract
Customized generation has achieved significant progress in image synthesis, yet personalized video generation remains challenging due to temporal inconsistencies and quality degradation. In this paper, we introduce CustomVideoX, an innovative framework leveraging the video diffusion transformer for personalized video generation from a reference image. CustomVideoX capitalizes on pre-trained video networks by exclusively training the LoRA parameters to extract reference features, ensuring both efficiency and adaptability. To facilitate seamless interaction between the reference image and video content, we propose 3D Reference Attention, which enables direct and simultaneous engagement of reference image features with all video frames across spatial and temporal dimensions. To mitigate the excessive influence of reference image features and textual guidance on generated video content during inference, we implement the Time-Aware Reference Attention Bias (TAB) strategy, dynamically modulating reference bias over different time steps. Additionally, we introduce the Entity Region-Aware Enhancement (ERAE) module, aligning highly activated regions of key entity tokens with reference feature injection by adjusting attention bias. To thoroughly evaluate personalized video generation, we establish a new benchmark, VideoBench, comprising over 50 objects and 100 prompts for extensive assessment. Experimental results show that CustomVideoX significantly outperforms existing methods in terms of video consistency and quality.
Problem

Research questions and friction points this paper is trying to address.

personalized video generation
temporal inconsistencies
reference image features
Innovation

Methods, ideas, or system contributions that make the work stand out.

LoRA parameters for feature extraction
3D Reference Attention for video frames
Time-Aware Reference Attention Bias strategy
🔎 Similar Papers
No similar papers found.
D
D. She
University of Science and Technology of China
Mushui Liu
Mushui Liu
Zhejiang University
Generative ModelsMulti-modal LearningFew-shot Learning
J
Jingxuan Pang
Zhejiang University
J
Jin Wang
University of Science and Technology of China
Z
Zhen Yang
Hong Kong University of Science and Technology (Guangzhou)
Wanggui He
Wanggui He
Researcher, Alibaba Group
ai
G
Guanghao Zhang
Y
Yi Wang
Zhejiang University
Qihan Huang
Qihan Huang
PhD Student, Zhejiang University
H
Haobin Tang
University of Science and Technology of China
Y
Yunlong Yu
Zhejiang University
Siming Fu
Siming Fu
Zhejiang University
LLM,Long-tailed learningMulti-modal