T2VPhysBench: A First-Principles Benchmark for Physical Consistency in Text-to-Video Generation

📅 2025-05-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Despite significant advances in visual quality, state-of-the-art text-to-video (T2V) models routinely violate fundamental physical laws—such as Newtonian mechanics and energy conservation—generating physically implausible content. Method: We introduce the first first-principles-based benchmark for physics consistency, systematically evaluating compliance with 12 core physical laws. Our methodology integrates physics-aware simulation validation, structured prompt engineering, a three-stage human expert evaluation, and quantitative compliance metrics. We further propose prompt-hint ablation studies and counterfactual robustness testing to diagnose failure modes. Contribution/Results: Experiments across leading open-source and commercial T2V models reveal average compliance rates below 0.60 across all physical laws. Detailed physics-informed prompts yield negligible improvement, and models remain highly susceptible to adversarial prompting that elicits explicit violations. This work establishes a novel paradigm for assessing trustworthiness and achieving physics alignment in T2V generation.

Technology Category

Application Category

📝 Abstract
Text-to-video generative models have made significant strides in recent years, producing high-quality videos that excel in both aesthetic appeal and accurate instruction following, and have become central to digital art creation and user engagement online. Yet, despite these advancements, their ability to respect fundamental physical laws remains largely untested: many outputs still violate basic constraints such as rigid-body collisions, energy conservation, and gravitational dynamics, resulting in unrealistic or even misleading content. Existing physical-evaluation benchmarks typically rely on automatic, pixel-level metrics applied to simplistic, life-scenario prompts, and thus overlook both human judgment and first-principles physics. To fill this gap, we introduce extbf{T2VPhysBench}, a first-principled benchmark that systematically evaluates whether state-of-the-art text-to-video systems, both open-source and commercial, obey twelve core physical laws including Newtonian mechanics, conservation principles, and phenomenological effects. Our benchmark employs a rigorous human evaluation protocol and includes three targeted studies: (1) an overall compliance assessment showing that all models score below 0.60 on average in each law category; (2) a prompt-hint ablation revealing that even detailed, law-specific hints fail to remedy physics violations; and (3) a counterfactual robustness test demonstrating that models often generate videos that explicitly break physical rules when so instructed. The results expose persistent limitations in current architectures and offer concrete insights for guiding future research toward truly physics-aware video generation.
Problem

Research questions and friction points this paper is trying to address.

Evaluates text-to-video models' adherence to physical laws
Identifies violations in rigid-body collisions and energy conservation
Assesses model robustness with counterfactual physical instructions
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces T2VPhysBench for physics evaluation
Uses human judgment and first-principles physics
Tests twelve core physical laws systematically
🔎 Similar Papers
No similar papers found.
Xuyang Guo
Xuyang Guo
Guilin University of Electronic Technology
Machine Learning
J
Jiayan Huo
University of Arizona
Zhenmei Shi
Zhenmei Shi
Senior Research Scientist at MongoDB + Voyage AI; PhD from University of Wisconsin–Madison
Deep LearningMachine LearningArtificial Intelligence
Z
Zhao Song
The Simons Institute for the Theory of Computing at the UC, Berkeley
J
Jiahao Zhang
J
Jiale Zhao
Arizona State University