Cross-Frame Representation Alignment for Fine-Tuning Video Diffusion Models

📅 2025-06-10

📈 Citations: 0

✨ Influential: 0

career value

221K/year

🤖 AI Summary

User-level fine-tuning of video diffusion models (VDMs) often suffers from poor temporal coherence due to inter-frame semantic inconsistency—particularly unresolved in fine-grained attribute alignment tasks. To address this, we propose Cross-Frame Representation Alignment (CREPA), a novel paradigm that, for the first time, incorporates external visual features from neighboring frames into latent-state regularization. CREPA extends the Representation Alignment (REPA) framework by jointly enforcing multi-frame temporal feature alignment and diffusion latent-space constraints, while integrating parameter-efficient adaptation strategies such as LoRA. Experiments on CogVideoX-5B and Hunyuan Video demonstrate that CREPA significantly improves visual fidelity and inter-frame semantic consistency under fine-grained fine-tuning. Moreover, it exhibits strong generalization across diverse attribute datasets, without requiring architectural modifications or additional inference overhead.

Technology Category

Application Category

📝 Abstract

Fine-tuning Video Diffusion Models (VDMs) at the user level to generate videos that reflect specific attributes of training data presents notable challenges, yet remains underexplored despite its practical importance. Meanwhile, recent work such as Representation Alignment (REPA) has shown promise in improving the convergence and quality of DiT-based image diffusion models by aligning, or assimilating, its internal hidden states with external pretrained visual features, suggesting its potential for VDM fine-tuning. In this work, we first propose a straightforward adaptation of REPA for VDMs and empirically show that, while effective for convergence, it is suboptimal in preserving semantic consistency across frames. To address this limitation, we introduce Cross-frame Representation Alignment (CREPA), a novel regularization technique that aligns hidden states of a frame with external features from neighboring frames. Empirical evaluations on large-scale VDMs, including CogVideoX-5B and Hunyuan Video, demonstrate that CREPA improves both visual fidelity and cross-frame semantic coherence when fine-tuned with parameter-efficient methods such as LoRA. We further validate CREPA across diverse datasets with varying attributes, confirming its broad applicability. Project page: https://crepavideo.github.io

Problem

Research questions and friction points this paper is trying to address.

Fine-tuning VDMs to reflect specific training data attributes

Improving cross-frame semantic consistency in video generation

Enhancing visual fidelity with parameter-efficient fine-tuning methods

Innovation

Methods, ideas, or system contributions that make the work stand out.

Adapts REPA for Video Diffusion Models

Introduces Cross-frame Representation Alignment (CREPA)

Enhances visual fidelity and semantic coherence

🔎 Similar Papers

Generative Inbetweening: Adapting Image-to-Video Models for Keyframe Interpolation

2024-08-27arXiv.orgCitations: 7