Physion-Eval: Evaluating Physical Realism in Generated Video via Human Reasoning

📅 2026-03-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the lack of fine-grained evaluation of physical plausibility in current video generation models, which hinders the identification of specific causes behind violations of physical laws during dynamic processes. To this end, we introduce a large-scale benchmark grounded in expert human reasoning, featuring fine-grained reasoning trajectories that include temporal localization, structured failure categories, and natural language explanations across 22 physical phenomena. The benchmark integrates real reference videos, expert annotations, and a physics-based taxonomy to form a high-quality human-evaluated dataset. Experiments reveal that among videos generated by state-of-the-art models in physics-critical scenarios, 83.3% (third-person) and 93.5% (first-person) contain at least one human-identifiable physical inconsistency, underscoring the urgent need for standardized evaluation protocols and highlighting the diagnostic value of our benchmark.

Technology Category

Application Category

📝 Abstract
Video generation models are increasingly used as world simulators for storytelling, simulation, and embodied AI. As these models advance, a key question arises: do generated videos obey the physical laws of the real world? Existing evaluations largely rely on automated metrics or coarse human judgments such as preferences or rubric-based checks. While useful for assessing perceptual quality, these methods provide limited insight into when and why generated dynamics violate real-world physical constraints. We introduce Physion-Eval, a large-scale benchmark of expert human reasoning for diagnosing physical realism failures in videos generated by five state-of-the-art models across egocentric and exocentric views, containing 10,990 expert reasoning traces spanning 22 fine-grained physical categories. Each generated video is derived from a corresponding real-world reference video depicting a clear physical process, and annotated with temporally localized glitches, structured failure categories, and natural-language explanations of the violated physical behavior. Using this dataset, we reveal a striking limitation of current video generation models: in physics-critical scenarios, 83.3% of exocentric and 93.5% of egocentric generated videos exhibit at least one human-identifiable physical glitch. We hope Physion-Eval will set a new standard for physical realism evaluation and guide the development of physics-grounded video generation. The benchmark is publicly available at https://huggingface.co/datasets/PhysionLabs/Physion-Eval.
Problem

Research questions and friction points this paper is trying to address.

physical realism
video generation
physics evaluation
human reasoning
generated video
Innovation

Methods, ideas, or system contributions that make the work stand out.

physical realism
video generation
human reasoning
failure diagnosis
benchmark
🔎 Similar Papers