Beyond Static Vision: Scene Dynamic Field Unlocks Intuitive Physics Understanding in Multi-modal Large Language Models

📅 2026-03-30

📈 Citations: 0

✨ Influential: 0

career value

210K/year

🤖 AI Summary

This work addresses the limited capability of multimodal large language models in intuitively understanding the dynamics of continuous media, such as fluids, particularly in high-level physical reasoning tasks. Focusing specifically on continuum dynamics—a domain previously unexplored in this context—the study proposes a lightweight and efficient Scene Dynamic Field (SDF) mechanism that injects dynamic physical priors into the model by integrating physics simulators with multi-task fine-tuning. To systematically evaluate model performance, two novel benchmark tasks are introduced: Next Frame Selection and Temporal Coherence Verification. Experimental results demonstrate that the proposed approach achieves up to a 20.7% performance gain on fluid-related tasks and exhibits strong generalization capabilities in unseen physical scenarios.

Technology Category

Application Category

📝 Abstract

While Multimodal Large Language Models (MLLMs) have demonstrated impressive capabilities in image and video understanding, their ability to comprehend the physical world has become an increasingly important research focus. Despite their improvements, current MLLMs struggle significantly with high-level physics reasoning. In this work, we investigate the first step of physical reasoning, i.e., intuitive physics understanding, revealing substantial limitations in understanding the dynamics of continuum objects. To isolate and evaluate this specific capability, we introduce two fundamental benchmark tasks: Next Frame Selection (NFS) and Temporal Coherence Verification (TCV). Our experiments demonstrate that even state-of-the-art MLLMs perform poorly on these foundational tasks. To address this limitation, we propose Scene Dynamic Field (SDF), a concise approach that leverages physics simulators within a multi-task fine-tuning framework. SDF substantially improves performance, achieving up to 20.7% gains on fluid tasks while showing strong generalization to unseen physical domains. This work not only highlights a critical gap in current MLLMs but also presents a promising cost-efficient approach for developing more physically grounded MLLMs. Our code and data are available at https://github.com/andylinx/Scene-Dynamic-Field.

Problem

Research questions and friction points this paper is trying to address.

intuitive physics understanding

multimodal large language models

physical reasoning

continuum objects dynamics

Innovation

Methods, ideas, or system contributions that make the work stand out.

Scene Dynamic Field

intuitive physics understanding

multimodal large language models

physics simulation