WorldBench: Disambiguating Physics for Diagnostic Evaluation of World Models

📅 2026-01-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing physical video benchmarks suffer from conceptual entanglement, making it difficult to accurately assess the physical understanding capabilities of world models. This work proposes the first conceptually disentangled evaluation framework for physical reasoning, which constructs a fine-grained benchmark encompassing dimensions such as object permanence, scale perspective, friction coefficients, and fluid viscosity. By integrating video generation, hierarchical organization of physical concepts, and controlled-variable testing, the framework enables precise and independent diagnosis of individual physical concepts or laws. Experiments on WorldBench reveal systematic deficiencies in current state-of-the-art world models regarding specific physical principles, highlighting their lack of the physical consistency required to generate plausible real-world interactions. This approach significantly enhances both the diagnostic precision and scalability of physical reasoning evaluation.

Technology Category

Application Category

📝 Abstract
Recent advances in generative foundational models, often termed"world models,"have propelled interest in applying them to critical tasks like robotic planning and autonomous system training. For reliable deployment, these models must exhibit high physical fidelity, accurately simulating real-world dynamics. Existing physics-based video benchmarks, however, suffer from entanglement, where a single test simultaneously evaluates multiple physical laws and concepts, fundamentally limiting their diagnostic capability. We introduce WorldBench, a novel video-based benchmark specifically designed for concept-specific, disentangled evaluation, allowing us to rigorously isolate and assess understanding of a single physical concept or law at a time. To make WorldBench comprehensive, we design benchmarks at two different levels: 1) an evaluation of intuitive physical understanding with concepts such as object permanence or scale/perspective, and 2) an evaluation of low-level physical constants and material properties such as friction coefficients or fluid viscosity. When SOTA video-based world models are evaluated on WorldBench, we find specific patterns of failure in particular physics concepts, with all tested models lacking the physical consistency required to generate reliable real-world interactions. Through its concept-specific evaluation, WorldBench offers a more nuanced and scalable framework for rigorously evaluating the physical reasoning capabilities of video generation and world models, paving the way for more robust and generalizable world-model-driven learning.
Problem

Research questions and friction points this paper is trying to address.

world models
physics disentanglement
diagnostic evaluation
physical fidelity
video benchmarks
Innovation

Methods, ideas, or system contributions that make the work stand out.

disentangled evaluation
world models
physics benchmark
physical fidelity
concept-specific assessment
🔎 Similar Papers
No similar papers found.
R
Rishi Upadhyay
University of California, Los Angeles
Howard Zhang
Howard Zhang
PhD Student at UCLA
Computer VisionComputational Imaging
J
Jim Solomon
University of California, Los Angeles
A
Ayush Agrawal
University of California, Los Angeles
P
Pranay Boreddy
University of California, Los Angeles
S
Shruti Satya Narayana
University of California, Los Angeles
Yunhao Ba
Yunhao Ba
Sony Research
Computer Vision
Alex Wong
Alex Wong
Yale University
Computer visionMachine learning3D visionUnsupervised learningAdversarial robustness
C
Celso M de Melo
DEVCOM Army Research Laboratory
Achuta Kadambi
Achuta Kadambi
Associate Professor of Electrical Engineering and Computer Science at UCLA
Spatial IntelligencePhysics-based VisionComputational ImagingRoboticsMedical Devices