Thinking in Dynamics: How Multimodal Large Language Models Perceive, Track, and Reason Dynamics in Physical 4D World

📅 2026-03-13

📈 Citations: 0

✨ Influential: 0

career value

202K/year

🤖 AI Summary

Current vision-language models excel at static image understanding but exhibit limited capabilities in perceiving and reasoning about spatiotemporal dynamics, object motion, and interactions in dynamic 4D scenes. To address this gap, this work introduces Dyn-Bench, a large-scale evaluation benchmark comprising 1K videos, 7K spatiotemporal question-answer pairs, and 3K dynamic grounding pairs, enabling the first systematic assessment of multimodal large models’ holistic dynamic understanding in 4D environments. The study reveals a performance imbalance between spatiotemporal reasoning and dynamic grounding, prompting the proposal of a structured fusion strategy—featuring Mask-Guided Fusion and a Spatio-Temporal Textual Cognitive Map (ST-TCM)—which significantly enhances the model’s coherent comprehension of motion and interaction. Extensive experiments on Dyn-Bench validate the effectiveness of the proposed approach.

Technology Category

Application Category

📝 Abstract

Humans inhabit a physical 4D world where geometric structure and semantic content evolve over time, constituting a dynamic 4D reality (spatial with temporal dimension). While current Multimodal Large Language Models (MLLMs) excel in static visual understanding, can they also be adept at "thinking in dynamics", i.e., perceive, track and reason about spatio-temporal dynamics in evolving scenes? To systematically assess their spatio-temporal reasoning and localized dynamics perception capabilities, we introduce Dyn-Bench, a large-scale benchmark built from diverse real-world and synthetic video datasets, enabling robust and scalable evaluation of spatio-temporal understanding. Through multi-stage filtering from massive 2D and 4D data sources, Dyn-Bench provides a high-quality collection of dynamic scenes, comprising 1k videos, 7k visual question answering (VQA) pairs, and 3k dynamic object grounding pairs. We probe general, spatial and region-level MLLMs to express how they think in dynamics both linguistically and visually, and find that existing models cannot simultaneously maintain strong performance in both spatio-temporal reasoning and dynamic object grounding, often producing inconsistent interpretations of motion and interaction. Notably, conventional prompting strategies (e.g., chain-of-thought or caption-based hints) provide limited improvement, whereas structured integration approaches, including Mask-Guided Fusion and Spatio-Temporal Textual Cognitive Map (ST-TCM), significantly enhance MLLMs' dynamics perception and spatio-temporal reasoning in the physical 4D world. Code and benchmark are available at https://dyn-bench.github.io/.

Problem

Research questions and friction points this paper is trying to address.

Multimodal Large Language Models

spatio-temporal reasoning

dynamic perception

4D world

video understanding

Innovation

Methods, ideas, or system contributions that make the work stand out.

Spatio-temporal reasoning

Dynamic object grounding

Multimodal Large Language Models