Cross-Frame Representation Alignment for Fine-Tuning Video Diffusion Models

📅 2025-06-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
User-level fine-tuning of video diffusion models (VDMs) often suffers from poor temporal coherence due to inter-frame semantic inconsistency—particularly unresolved in fine-grained attribute alignment tasks. To address this, we propose Cross-Frame Representation Alignment (CREPA), a novel paradigm that, for the first time, incorporates external visual features from neighboring frames into latent-state regularization. CREPA extends the Representation Alignment (REPA) framework by jointly enforcing multi-frame temporal feature alignment and diffusion latent-space constraints, while integrating parameter-efficient adaptation strategies such as LoRA. Experiments on CogVideoX-5B and Hunyuan Video demonstrate that CREPA significantly improves visual fidelity and inter-frame semantic consistency under fine-grained fine-tuning. Moreover, it exhibits strong generalization across diverse attribute datasets, without requiring architectural modifications or additional inference overhead.

Technology Category

Application Category

📝 Abstract
Fine-tuning Video Diffusion Models (VDMs) at the user level to generate videos that reflect specific attributes of training data presents notable challenges, yet remains underexplored despite its practical importance. Meanwhile, recent work such as Representation Alignment (REPA) has shown promise in improving the convergence and quality of DiT-based image diffusion models by aligning, or assimilating, its internal hidden states with external pretrained visual features, suggesting its potential for VDM fine-tuning. In this work, we first propose a straightforward adaptation of REPA for VDMs and empirically show that, while effective for convergence, it is suboptimal in preserving semantic consistency across frames. To address this limitation, we introduce Cross-frame Representation Alignment (CREPA), a novel regularization technique that aligns hidden states of a frame with external features from neighboring frames. Empirical evaluations on large-scale VDMs, including CogVideoX-5B and Hunyuan Video, demonstrate that CREPA improves both visual fidelity and cross-frame semantic coherence when fine-tuned with parameter-efficient methods such as LoRA. We further validate CREPA across diverse datasets with varying attributes, confirming its broad applicability. Project page: https://crepavideo.github.io
Problem

Research questions and friction points this paper is trying to address.

Fine-tuning VDMs to reflect specific training data attributes
Improving cross-frame semantic consistency in video generation
Enhancing visual fidelity with parameter-efficient fine-tuning methods
Innovation

Methods, ideas, or system contributions that make the work stand out.

Adapts REPA for Video Diffusion Models
Introduces Cross-frame Representation Alignment (CREPA)
Enhances visual fidelity and semantic coherence
🔎 Similar Papers
No similar papers found.