VideoREPA: Learning Physics for Video Generation through Relational Alignment with Foundation Models

📅 2025-05-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current text-to-video (T2V) diffusion models generate high-fidelity videos but suffer from poor physical plausibility due to insufficient modeling of physical laws. To address this, we propose the first physics-knowledge-injection framework for fine-tuning T2V diffusion models, which distills physical understanding from a video self-supervised foundation model into the diffusion generator to enhance physical consistency. Our core innovation is a token-level relation distillation (TRD) loss that enables soft spatiotemporal alignment between the teacher and student models. Evaluated on CogVideoX, our method significantly improves performance on physical commonsense benchmarks—yielding videos that better adhere to fundamental physical principles such as gravity, collision dynamics, and motion continuity. This work establishes a new paradigm for controllable, physics-aware video generation.

Technology Category

Application Category

📝 Abstract
Recent advancements in text-to-video (T2V) diffusion models have enabled high-fidelity and realistic video synthesis. However, current T2V models often struggle to generate physically plausible content due to their limited inherent ability to accurately understand physics. We found that while the representations within T2V models possess some capacity for physics understanding, they lag significantly behind those from recent video self-supervised learning methods. To this end, we propose a novel framework called VideoREPA, which distills physics understanding capability from video understanding foundation models into T2V models by aligning token-level relations. This closes the physics understanding gap and enable more physics-plausible generation. Specifically, we introduce the Token Relation Distillation (TRD) loss, leveraging spatio-temporal alignment to provide soft guidance suitable for finetuning powerful pre-trained T2V models, a critical departure from prior representation alignment (REPA) methods. To our knowledge, VideoREPA is the first REPA method designed for finetuning T2V models and specifically for injecting physical knowledge. Empirical evaluations show that VideoREPA substantially enhances the physics commonsense of baseline method, CogVideoX, achieving significant improvement on relevant benchmarks and demonstrating a strong capacity for generating videos consistent with intuitive physics. More video results are available at https://videorepa.github.io/.
Problem

Research questions and friction points this paper is trying to address.

Improving physics plausibility in text-to-video generation
Aligning token relations to distill physics understanding
Enhancing video generation with intuitive physics consistency
Innovation

Methods, ideas, or system contributions that make the work stand out.

Distills physics from foundation models
Uses Token Relation Distillation loss
Aligns token-level relations for physics
🔎 Similar Papers
No similar papers found.
X
Xiangdong Zhang
Dept. of CSE & School of AI & MoE Key Lab of Al, Shanghai Jiao Tong University
J
Jiaqi Liao
Dept. of CSE & School of AI & MoE Key Lab of Al, Shanghai Jiao Tong University
S
Shaofeng Zhang
Dept. of CSE & School of AI & MoE Key Lab of Al, Shanghai Jiao Tong University
Fanqing Meng
Fanqing Meng
Phd Student, Shanghai Jiaotong University
Multimodel LearningTransfer LearningLarge Language Model
X
Xiangpeng Wan
NetMind.AI
Junchi Yan
Junchi Yan
FIAPR & ICML Board Member, SJTU (2018-), SII (2024-), AWS (2019-2022), IBM (2011-2018)
Computational IntelligenceAI4ScienceMachine LearningAutonomous Driving
Y
Yu Cheng
The Chinese University of Hong Kong