PhyGround: Benchmarking Physical Reasoning in Generative World Models

📅 2026-05-11

📈 Citations: 0

✨ Influential: 0

career value

208K/year

🤖 AI Summary

This work addresses the lack of fine-grained, auditable evaluation methods for physical reasoning in current video generation models, which hinders diagnosis of their failures with respect to specific physical laws. The authors propose the first physics-grounded benchmark for video generation, encompassing 13 categories of physical principles—including solid mechanics, fluid dynamics, and optics—and comprising 250 prompts paired with expected outcomes. Leveraging a social science-inspired experimental design, they collect 5,796 annotation sets from 459 human annotators, yielding over 37.4K fine-grained labels. Built upon this data, they introduce PhyJudge-9B, an open-source vision-language model judge that enables interpretable, reproducible evaluation with low bias (reducing bias to 3.3% relative to Gemini-3.1-Pro) and high reliability (Spearman correlation > 0.90).

📝 Abstract

Generative world models are increasingly used for video generation, where learned simulators are expected to capture the physical rules that govern real-world dynamics. However, evaluating whether generated videos actually follow these rules remains challenging. Existing physics-focused video benchmarks have made important progress, but they still face three key challenges, including the coarse evaluation frameworks that hide law-specific failures, response biases and fatigue that undermine the validity of annotation judgments, and automated evaluators that are insufficiently physics-aware or difficult to audit. To address those challenges, we introduce PhyGround, a criteria-grounded benchmark for evaluating physical reasoning in video generation. The benchmark contains 250 curated prompts, each augmented with an expected physical outcome, and a taxonomy of 13 physical laws across solid-body mechanics, fluid dynamics, and optics. Each law is operationalized through observable sub-questions to enable per-law diagnostics. We evaluate eight modern video generation models through a large-scale, quality-controlled human study, grounded on social science lab experiment design. A total of 459 annotators provided 5,796 complete annotations and over 37.4K fine-grained labels; after quality control, the retained annotations exhibited high split-half model-ranking correlations (Spearman's rho > 0.90). To support reproducible automated evaluation, we release PhyJudge-9B, an open physics-specialized VLM judge. PhyJudge-9B achieves substantially lower aggregate relative bias than Gemini-3.1-Pro (3.3% vs. 16.6%). We release prompts, human annotations, model checkpoints, and evaluation code on the project page https://phyground.github.io/.

Problem

Research questions and friction points this paper is trying to address.

physical reasoning

video generation

generative world models

physics evaluation

benchmarking

Innovation

Methods, ideas, or system contributions that make the work stand out.

generative world models

physical reasoning benchmark

physics-aware evaluation