STI-Bench: Are MLLMs Ready for Precise Spatial-Temporal World Understanding?

📅 2025-03-31
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current multimodal large language models (MLLMs) lack systematic evaluation of spatiotemporal quantitative reasoning capabilities—critical for embodied intelligence and autonomous driving. Method: We introduce STI-Bench, the first real-world-oriented benchmark for spatiotemporal intelligence, covering desktop, indoor, and outdoor scenarios. It features fine-grained estimation and prediction tasks on object appearance, pose, displacement, and motion. We formally define and quantitatively evaluate core spatiotemporal competencies—including spatial distance estimation and dynamic motion analysis—using multi-source real videos, dense 3D annotations, physics-based simulation, and structured prompt engineering to ensure reproducibility. Contribution/Results: Experiments reveal that state-of-the-art MLLMs achieve only ~42% average accuracy on key tasks, exposing fundamental deficiencies in spatiotemporal quantitative reasoning. STI-Bench establishes a rigorous, scalable foundation for future model development and evaluation in spatiotemporal understanding.

Technology Category

Application Category

📝 Abstract
The use of Multimodal Large Language Models (MLLMs) as an end-to-end solution for Embodied AI and Autonomous Driving has become a prevailing trend. While MLLMs have been extensively studied for visual semantic understanding tasks, their ability to perform precise and quantitative spatial-temporal understanding in real-world applications remains largely unexamined, leading to uncertain prospects. To evaluate models' Spatial-Temporal Intelligence, we introduce STI-Bench, a benchmark designed to evaluate MLLMs' spatial-temporal understanding through challenging tasks such as estimating and predicting the appearance, pose, displacement, and motion of objects. Our benchmark encompasses a wide range of robot and vehicle operations across desktop, indoor, and outdoor scenarios. The extensive experiments reveals that the state-of-the-art MLLMs still struggle in real-world spatial-temporal understanding, especially in tasks requiring precise distance estimation and motion analysis.
Problem

Research questions and friction points this paper is trying to address.

Evaluating MLLMs' spatial-temporal understanding in real-world applications
Assessing MLLMs' precision in distance estimation and motion analysis
Benchmarking MLLMs' performance in diverse robot and vehicle operations
Innovation

Methods, ideas, or system contributions that make the work stand out.

STI-Bench evaluates MLLMs' spatial-temporal understanding
Benchmark includes object appearance, pose, displacement tasks
Tests MLLMs in desktop, indoor, outdoor scenarios
🔎 Similar Papers
No similar papers found.
Y
Yun Li
School of AI, Shanghai Jiao Tong University; China University of Geosciences
Y
Yiming Zhang
School of AI, Shanghai Jiao Tong University; Nanyang Technological University
T
Tao Lin
School of AI, Shanghai Jiao Tong University
X
XiangRui Liu
School of AI, Shanghai Jiao Tong University; BAAI
Wenxiao Cai
Wenxiao Cai
Stanford University
Z
Zheng Liu
BAAI
B
Bo Zhao
School of AI, Shanghai Jiao Tong University