🤖 AI Summary
Existing multimodal large language models (MLLMs) lack the ability to perceive and reason about 4D spatiotemporal dynamics from 2D visual inputs. To address this limitation, this work proposes MLLM-4D, the first training paradigm for MLLMs tailored toward 4D spatiotemporal intelligence. By efficiently leveraging stereoscopic video data, the authors construct two instruction datasets—MLLM4D-2M and MLLM4D-R1-30k—and integrate supervised fine-tuning (SFT) with reinforcement fine-tuning based on Group Relative Policy Optimization (GRPO). Without modifying the underlying model architecture, they introduce Spatiotemporal Chain-of-Thought (ST-CoT) prompting and a spatiotemporal reward function (ST-reward). Experiments demonstrate that, using only 2D RGB inputs, the proposed approach achieves state-of-the-art performance in 4D spatiotemporal understanding and reasoning on the MLLM4D-Bench.
📝 Abstract
Humans are born with vision-based 4D spatial-temporal intelligence, which enables us to perceive and reason about the evolution of 3D space over time from purely visual inputs. Despite its importance, this capability remains a significant bottleneck for current multimodal large language models (MLLMs). To tackle this challenge, we introduce MLLM-4D, a comprehensive framework designed to bridge the gaps in training data curation and model post-training for spatiotemporal understanding and reasoning. On the data front, we develop a cost-efficient data curation pipeline that repurposes existing stereo video datasets into high-quality 4D spatiotemporal instructional data. This results in the MLLM4D-2M and MLLM4D-R1-30k datasets for Supervised Fine-Tuning (SFT) and Reinforcement Fine-Tuning (RFT), alongside MLLM4D-Bench for comprehensive evaluation. Regarding model training, our post-training strategy establishes a foundational 4D understanding via SFT and further catalyzes 4D reasoning capabilities by employing Group Relative Policy Optimization (GRPO) with specialized Spatiotemporal Chain of Thought (ST-CoT) prompting and Spatiotemporal reward functions (ST-reward) without involving the modification of architecture. Extensive experiments demonstrate that MLLM-4D achieves state-of-the-art spatial-temporal understanding and reasoning capabilities from purely 2D RGB inputs. Project page: https://github.com/GVCLab/MLLM-4D.