🤖 AI Summary
Current multimodal large language models (MLLMs) lack systematic evaluation of spatiotemporal quantitative reasoning capabilities—critical for embodied intelligence and autonomous driving.
Method: We introduce STI-Bench, the first real-world-oriented benchmark for spatiotemporal intelligence, covering desktop, indoor, and outdoor scenarios. It features fine-grained estimation and prediction tasks on object appearance, pose, displacement, and motion. We formally define and quantitatively evaluate core spatiotemporal competencies—including spatial distance estimation and dynamic motion analysis—using multi-source real videos, dense 3D annotations, physics-based simulation, and structured prompt engineering to ensure reproducibility.
Contribution/Results: Experiments reveal that state-of-the-art MLLMs achieve only ~42% average accuracy on key tasks, exposing fundamental deficiencies in spatiotemporal quantitative reasoning. STI-Bench establishes a rigorous, scalable foundation for future model development and evaluation in spatiotemporal understanding.
📝 Abstract
The use of Multimodal Large Language Models (MLLMs) as an end-to-end solution for Embodied AI and Autonomous Driving has become a prevailing trend. While MLLMs have been extensively studied for visual semantic understanding tasks, their ability to perform precise and quantitative spatial-temporal understanding in real-world applications remains largely unexamined, leading to uncertain prospects. To evaluate models' Spatial-Temporal Intelligence, we introduce STI-Bench, a benchmark designed to evaluate MLLMs' spatial-temporal understanding through challenging tasks such as estimating and predicting the appearance, pose, displacement, and motion of objects. Our benchmark encompasses a wide range of robot and vehicle operations across desktop, indoor, and outdoor scenarios. The extensive experiments reveals that the state-of-the-art MLLMs still struggle in real-world spatial-temporal understanding, especially in tasks requiring precise distance estimation and motion analysis.