VideoPhy-2: A Challenging Action-Centric Physical Commonsense Evaluation in Video Generation

📅 2025-03-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the lack of physical commonsense reasoning in large-scale video generation models when synthesizing realistic human actions. To this end, we introduce VideoPhy-2, the first action-centered physical commonsense evaluation benchmark for video generation. It comprises 200 real-world human actions and enables joint assessment of semantic consistency, physical plausibility, and fine-grained adherence to physical laws (e.g., conservation of mass and momentum). We propose an innovative “action-driven” evaluation paradigm that integrates large-scale human annotation with our novel automated evaluator, VideoPhy-AutoEval—a CLIP-based model fine-tuned with physics-aware constraints. Experiments reveal that state-of-the-art video generation models achieve only 22% joint semantic and physical correctness on the challenging subset of VideoPhy-2. To foster reproducible research, we publicly release the full benchmark dataset, prompt templates, and evaluation code, establishing a foundational resource for developing physically grounded video generation systems.

Technology Category

Application Category

📝 Abstract
Large-scale video generative models, capable of creating realistic videos of diverse visual concepts, are strong candidates for general-purpose physical world simulators. However, their adherence to physical commonsense across real-world actions remains unclear (e.g., playing tennis, backflip). Existing benchmarks suffer from limitations such as limited size, lack of human evaluation, sim-to-real gaps, and absence of fine-grained physical rule analysis. To address this, we introduce VideoPhy-2, an action-centric dataset for evaluating physical commonsense in generated videos. We curate 200 diverse actions and detailed prompts for video synthesis from modern generative models. We perform human evaluation that assesses semantic adherence, physical commonsense, and grounding of physical rules in the generated videos. Our findings reveal major shortcomings, with even the best model achieving only 22% joint performance (i.e., high semantic and physical commonsense adherence) on the hard subset of VideoPhy-2. We find that the models particularly struggle with conservation laws like mass and momentum. Finally, we also train VideoPhy-AutoEval, an automatic evaluator for fast, reliable assessment on our dataset. Overall, VideoPhy-2 serves as a rigorous benchmark, exposing critical gaps in video generative models and guiding future research in physically-grounded video generation. The data and code is available at https://videophy2.github.io/.
Problem

Research questions and friction points this paper is trying to address.

Evaluates physical commonsense in video generation models.
Identifies gaps in adherence to physical laws like mass and momentum.
Introduces a benchmark for physically-grounded video generation research.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces VideoPhy-2 for physical commonsense evaluation
Human evaluation assesses semantic and physical adherence
Trains VideoPhy-AutoEval for fast, reliable video assessment
🔎 Similar Papers
No similar papers found.