MLLM-4D: Towards Visual-based Spatial-Temporal Intelligence

📅 2026-02-28

📈 Citations: 0

✨ Influential: 0

career value

198K/year

🤖 AI Summary

Existing multimodal large language models (MLLMs) lack the ability to perceive and reason about 4D spatiotemporal dynamics from 2D visual inputs. To address this limitation, this work proposes MLLM-4D, the first training paradigm for MLLMs tailored toward 4D spatiotemporal intelligence. By efficiently leveraging stereoscopic video data, the authors construct two instruction datasets—MLLM4D-2M and MLLM4D-R1-30k—and integrate supervised fine-tuning (SFT) with reinforcement fine-tuning based on Group Relative Policy Optimization (GRPO). Without modifying the underlying model architecture, they introduce Spatiotemporal Chain-of-Thought (ST-CoT) prompting and a spatiotemporal reward function (ST-reward). Experiments demonstrate that, using only 2D RGB inputs, the proposed approach achieves state-of-the-art performance in 4D spatiotemporal understanding and reasoning on the MLLM4D-Bench.

Technology Category

Application Category

📝 Abstract

Humans are born with vision-based 4D spatial-temporal intelligence, which enables us to perceive and reason about the evolution of 3D space over time from purely visual inputs. Despite its importance, this capability remains a significant bottleneck for current multimodal large language models (MLLMs). To tackle this challenge, we introduce MLLM-4D, a comprehensive framework designed to bridge the gaps in training data curation and model post-training for spatiotemporal understanding and reasoning. On the data front, we develop a cost-efficient data curation pipeline that repurposes existing stereo video datasets into high-quality 4D spatiotemporal instructional data. This results in the MLLM4D-2M and MLLM4D-R1-30k datasets for Supervised Fine-Tuning (SFT) and Reinforcement Fine-Tuning (RFT), alongside MLLM4D-Bench for comprehensive evaluation. Regarding model training, our post-training strategy establishes a foundational 4D understanding via SFT and further catalyzes 4D reasoning capabilities by employing Group Relative Policy Optimization (GRPO) with specialized Spatiotemporal Chain of Thought (ST-CoT) prompting and Spatiotemporal reward functions (ST-reward) without involving the modification of architecture. Extensive experiments demonstrate that MLLM-4D achieves state-of-the-art spatial-temporal understanding and reasoning capabilities from purely 2D RGB inputs. Project page: https://github.com/GVCLab/MLLM-4D.

Problem

Research questions and friction points this paper is trying to address.

spatiotemporal reasoning

multimodal large language models

4D perception

visual intelligence

temporal understanding

Innovation

Methods, ideas, or system contributions that make the work stand out.

4D spatiotemporal reasoning

multimodal large language model

stereo video repurposing