PhysicsMind: Sim and Real Mechanics Benchmarking for Physical Reasoning and Prediction in Foundational VLMs and World Models

📅 2026-01-22

📈 Citations: 0

✨ Influential: 0

career value

222K/year

🤖 AI Summary

This work addresses the lack of systematic evaluation of physical understanding in current foundational multimodal large models and video world models, noting that existing benchmarks either rely on synthetic templates or prioritize perceptual quality over physical consistency. To bridge this gap, the authors propose the first unified evaluation benchmark that integrates real-world and simulated data, centered on three fundamental principles of classical mechanics: center of mass, lever equilibrium, and Newton’s first law. The benchmark features visual question answering and video generation tasks designed to jointly assess a model’s capacity for physical reasoning and dynamic scene generation. Experimental results reveal that prevailing models predominantly rely on superficial heuristics and frequently violate core mechanical constraints, exposing fundamental limitations in their training paradigms with respect to genuine physical understanding.

Technology Category

Application Category

📝 Abstract

Modern foundational Multimodal Large Language Models (MLLMs) and video world models have advanced significantly in mathematical, common-sense, and visual reasoning, but their grasp of the underlying physics remains underexplored. Existing benchmarks attempting to measure this matter rely on synthetic, Visual Question Answer templates or focus on perceptual video quality that is tangential to measuring how well the video abides by physical laws. To address this fragmentation, we introduce PhysicsMind, a unified benchmark with both real and simulation environments that evaluates law-consistent reasoning and generation over three canonical principles: Center of Mass, Lever Equilibrium, and Newton's First Law. PhysicsMind comprises two main tasks: i) VQA tasks, testing whether models can reason and determine physical quantities and values from images or short videos, and ii) Video Generation(VG) tasks, evaluating if predicted motion trajectories obey the same center-of-mass, torque, and inertial constraints as the ground truth. A broad range of recent models and video generation models is evaluated on PhysicsMind and found to rely on appearance heuristics while often violating basic mechanics. These gaps indicate that current scaling and training are still insufficient for robust physical understanding, underscoring PhysicsMind as a focused testbed for physics-aware multimodal models. Our data will be released upon acceptance.

Problem

Research questions and friction points this paper is trying to address.

physical reasoning

world models

benchmarking

multimodal models

physics understanding

Innovation

Methods, ideas, or system contributions that make the work stand out.

physical reasoning

multimodal foundation models

video generation