🤖 AI Summary
Existing physical video benchmarks suffer from conceptual entanglement, making it difficult to accurately assess the physical understanding capabilities of world models. This work proposes the first conceptually disentangled evaluation framework for physical reasoning, which constructs a fine-grained benchmark encompassing dimensions such as object permanence, scale perspective, friction coefficients, and fluid viscosity. By integrating video generation, hierarchical organization of physical concepts, and controlled-variable testing, the framework enables precise and independent diagnosis of individual physical concepts or laws. Experiments on WorldBench reveal systematic deficiencies in current state-of-the-art world models regarding specific physical principles, highlighting their lack of the physical consistency required to generate plausible real-world interactions. This approach significantly enhances both the diagnostic precision and scalability of physical reasoning evaluation.
📝 Abstract
Recent advances in generative foundational models, often termed"world models,"have propelled interest in applying them to critical tasks like robotic planning and autonomous system training. For reliable deployment, these models must exhibit high physical fidelity, accurately simulating real-world dynamics. Existing physics-based video benchmarks, however, suffer from entanglement, where a single test simultaneously evaluates multiple physical laws and concepts, fundamentally limiting their diagnostic capability. We introduce WorldBench, a novel video-based benchmark specifically designed for concept-specific, disentangled evaluation, allowing us to rigorously isolate and assess understanding of a single physical concept or law at a time. To make WorldBench comprehensive, we design benchmarks at two different levels: 1) an evaluation of intuitive physical understanding with concepts such as object permanence or scale/perspective, and 2) an evaluation of low-level physical constants and material properties such as friction coefficients or fluid viscosity. When SOTA video-based world models are evaluated on WorldBench, we find specific patterns of failure in particular physics concepts, with all tested models lacking the physical consistency required to generate reliable real-world interactions. Through its concept-specific evaluation, WorldBench offers a more nuanced and scalable framework for rigorously evaluating the physical reasoning capabilities of video generation and world models, paving the way for more robust and generalizable world-model-driven learning.