🤖 AI Summary
This work addresses the lack of fine-grained evaluation of physical plausibility in current video generation models, which hinders the identification of specific causes behind violations of physical laws during dynamic processes. To this end, we introduce a large-scale benchmark grounded in expert human reasoning, featuring fine-grained reasoning trajectories that include temporal localization, structured failure categories, and natural language explanations across 22 physical phenomena. The benchmark integrates real reference videos, expert annotations, and a physics-based taxonomy to form a high-quality human-evaluated dataset. Experiments reveal that among videos generated by state-of-the-art models in physics-critical scenarios, 83.3% (third-person) and 93.5% (first-person) contain at least one human-identifiable physical inconsistency, underscoring the urgent need for standardized evaluation protocols and highlighting the diagnostic value of our benchmark.
📝 Abstract
Video generation models are increasingly used as world simulators for storytelling, simulation, and embodied AI. As these models advance, a key question arises: do generated videos obey the physical laws of the real world? Existing evaluations largely rely on automated metrics or coarse human judgments such as preferences or rubric-based checks. While useful for assessing perceptual quality, these methods provide limited insight into when and why generated dynamics violate real-world physical constraints. We introduce Physion-Eval, a large-scale benchmark of expert human reasoning for diagnosing physical realism failures in videos generated by five state-of-the-art models across egocentric and exocentric views, containing 10,990 expert reasoning traces spanning 22 fine-grained physical categories. Each generated video is derived from a corresponding real-world reference video depicting a clear physical process, and annotated with temporally localized glitches, structured failure categories, and natural-language explanations of the violated physical behavior. Using this dataset, we reveal a striking limitation of current video generation models: in physics-critical scenarios, 83.3% of exocentric and 93.5% of egocentric generated videos exhibit at least one human-identifiable physical glitch. We hope Physion-Eval will set a new standard for physical realism evaluation and guide the development of physics-grounded video generation. The benchmark is publicly available at https://huggingface.co/datasets/PhysionLabs/Physion-Eval.