MASS: Motion-Aware Spatial-Temporal Grounding for Physics Reasoning and Comprehension in Vision-Language Models

📅 2025-11-23

📈 Citations: 0

✨ Influential: 0

career value

200K/year

🤖 AI Summary

Visual-language models (VLMs) exhibit limited capability in physics-driven reasoning—such as motion dynamics and spatial interactions—hindering their understanding of real and AI-generated videos and their ability to generate physically consistent content. To address this, we propose MASS: a framework integrating depth-guided 3D encoding, visual grounding, and object motion tracking to establish a motion-aware spatiotemporal localization mechanism; it further employs reinforcement learning fine-tuning to inject spatiotemporal signals into the language space, enhancing cross-modal alignment and physical reasoning. We introduce MASS-Bench—the first large-scale benchmark for physical understanding, covering both real-world and AI-generated videos. Experiments demonstrate that MASS achieves substantial gains over comparable and larger-parameter baselines (+8.7%, +6.0% on physics reasoning tasks), matching the performance of state-of-the-art closed-source models such as Gemini-2.5-Flash.

Technology Category

Application Category

📝 Abstract

Vision Language Models (VLMs) perform well on standard video tasks but struggle with physics-driven reasoning involving motion dynamics and spatial interactions. This limitation reduces their ability to interpret real or AI-generated content (AIGC) videos and to generate physically consistent content. We present an approach that addresses this gap by translating physical-world context cues into interpretable representations aligned with VLMs' perception, comprehension, and reasoning. We introduce MASS-Bench, a comprehensive benchmark consisting of 4,350 real-world and AIGC videos and 8,361 free-form video question-answering pairs focused on physics-related comprehension tasks, with detailed annotations including visual detections, sub-segment grounding, and full-sequence 3D motion tracking of entities. We further present MASS, a model-agnostic method that injects spatial-temporal signals into the VLM language space via depth-based 3D encoding and visual grounding, coupled with a motion tracker for object dynamics. To strengthen cross-modal alignment and reasoning, we apply reinforcement fine-tuning. Experiments and ablations show that our refined VLMs outperform comparable and larger baselines, as well as prior state-of-the-art models, by 8.7% and 6.0%, achieving performance comparable to close-source SoTA VLMs such as Gemini-2.5-Flash on physics reasoning and comprehension. These results validate the effectiveness of our approach.

Problem

Research questions and friction points this paper is trying to address.

VLMs struggle with physics-driven reasoning involving motion dynamics and spatial interactions

Limits ability to interpret real or AI-generated videos and generate physically consistent content

Addresses gap in translating physical-world context cues into interpretable VLM representations

Innovation

Methods, ideas, or system contributions that make the work stand out.

Injecting spatial-temporal signals into VLM language space

Using depth-based 3D encoding and visual grounding

Applying reinforcement fine-tuning for cross-modal alignment

🔎 Similar Papers

Compositional 4D Dynamic Scenes Understanding with Physics Priors for Video Question Answering