🤖 AI Summary
Despite significant advances in visual quality, state-of-the-art text-to-video (T2V) models routinely violate fundamental physical laws—such as Newtonian mechanics and energy conservation—generating physically implausible content.
Method: We introduce the first first-principles-based benchmark for physics consistency, systematically evaluating compliance with 12 core physical laws. Our methodology integrates physics-aware simulation validation, structured prompt engineering, a three-stage human expert evaluation, and quantitative compliance metrics. We further propose prompt-hint ablation studies and counterfactual robustness testing to diagnose failure modes.
Contribution/Results: Experiments across leading open-source and commercial T2V models reveal average compliance rates below 0.60 across all physical laws. Detailed physics-informed prompts yield negligible improvement, and models remain highly susceptible to adversarial prompting that elicits explicit violations. This work establishes a novel paradigm for assessing trustworthiness and achieving physics alignment in T2V generation.
📝 Abstract
Text-to-video generative models have made significant strides in recent years, producing high-quality videos that excel in both aesthetic appeal and accurate instruction following, and have become central to digital art creation and user engagement online. Yet, despite these advancements, their ability to respect fundamental physical laws remains largely untested: many outputs still violate basic constraints such as rigid-body collisions, energy conservation, and gravitational dynamics, resulting in unrealistic or even misleading content. Existing physical-evaluation benchmarks typically rely on automatic, pixel-level metrics applied to simplistic, life-scenario prompts, and thus overlook both human judgment and first-principles physics. To fill this gap, we introduce extbf{T2VPhysBench}, a first-principled benchmark that systematically evaluates whether state-of-the-art text-to-video systems, both open-source and commercial, obey twelve core physical laws including Newtonian mechanics, conservation principles, and phenomenological effects. Our benchmark employs a rigorous human evaluation protocol and includes three targeted studies: (1) an overall compliance assessment showing that all models score below 0.60 on average in each law category; (2) a prompt-hint ablation revealing that even detailed, law-specific hints fail to remedy physics violations; and (3) a counterfactual robustness test demonstrating that models often generate videos that explicitly break physical rules when so instructed. The results expose persistent limitations in current architectures and offer concrete insights for guiding future research toward truly physics-aware video generation.