Morpheus: Benchmarking Physical Reasoning of Video Generative Models with Real Physical Experiments

📅 2025-04-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current video generation models lack physical consistency, hindering the trustworthy development of embodied AI. Method: We introduce PhysEval—the first benchmark grounded in real-world physics experiments—comprising 80 empirically captured videos of diverse physical phenomena. We propose a no-ground-truth evaluation paradigm anchored in inviolable physical conservation laws (e.g., energy and momentum), and design a differentiable physical plausibility metric that synergistically integrates Physics-Informed Neural Networks (PINNs) with vision-language models. Contribution/Results: Experiments reveal that state-of-the-art models, while visually compelling, exhibit substantial physical inconsistencies. To foster reproducibility and community advancement, we open-source all benchmark data, implementation code, and a live leaderboard—establishing a new, rigorous, and publicly accessible standard for evaluating world-modeling capabilities in generative video systems.

Technology Category

Application Category

📝 Abstract
Recent advances in image and video generation raise hopes that these models possess world modeling capabilities, the ability to generate realistic, physically plausible videos. This could revolutionize applications in robotics, autonomous driving, and scientific simulation. However, before treating these models as world models, we must ask: Do they adhere to physical conservation laws? To answer this, we introduce Morpheus, a benchmark for evaluating video generation models on physical reasoning. It features 80 real-world videos capturing physical phenomena, guided by conservation laws. Since artificial generations lack ground truth, we assess physical plausibility using physics-informed metrics evaluated with respect to infallible conservation laws known per physical setting, leveraging advances in physics-informed neural networks and vision-language foundation models. Our findings reveal that even with advanced prompting and video conditioning, current models struggle to encode physical principles despite generating aesthetically pleasing videos. All data, leaderboard, and code are open-sourced at our project page.
Problem

Research questions and friction points this paper is trying to address.

Assessing video generative models' adherence to physical laws
Evaluating physical plausibility using physics-informed metrics
Benchmarking models' ability to generate realistic physical phenomena
Innovation

Methods, ideas, or system contributions that make the work stand out.

Benchmark with real-world physics videos
Physics-informed metrics for evaluation
Leverage vision-language foundation models
🔎 Similar Papers
2024-09-10arXiv.orgCitations: 0
C
Chenyu Zhang
University of Trento, Italy
Daniil Cherniavskii
Daniil Cherniavskii
PhD student, University of Amsterdam
Deep Learning
Andrii Zadaianchuk
Andrii Zadaianchuk
Postdoc, University of Amsterdam
machine learningself-supervised representation learningcausal inference
A
Antonios Tragoudaras
University of Amsterdam, the Netherlands
A
Antonios Vozikis
University of Amsterdam, the Netherlands
T
Thijmen Nijdam
University of Amsterdam, the Netherlands
D
Derck W. E. Prinzhorn
University of Amsterdam, the Netherlands
M
Mark Bodracska
University of Amsterdam, the Netherlands
N
N. Sebe
University of Trento, Italy
E
E. Gavves
University of Amsterdam, the Netherlands; Archimedes, Athena Research Center, Greece