STI-Bench: Are MLLMs Ready for Precise Spatial-Temporal World Understanding?

📅 2025-03-31

📈 Citations: 0

✨ Influential: 0

career value

200K/year

🤖 AI Summary

Current multimodal large language models (MLLMs) lack systematic evaluation of spatiotemporal quantitative reasoning capabilities—critical for embodied intelligence and autonomous driving. Method: We introduce STI-Bench, the first real-world-oriented benchmark for spatiotemporal intelligence, covering desktop, indoor, and outdoor scenarios. It features fine-grained estimation and prediction tasks on object appearance, pose, displacement, and motion. We formally define and quantitatively evaluate core spatiotemporal competencies—including spatial distance estimation and dynamic motion analysis—using multi-source real videos, dense 3D annotations, physics-based simulation, and structured prompt engineering to ensure reproducibility. Contribution/Results: Experiments reveal that state-of-the-art MLLMs achieve only ~42% average accuracy on key tasks, exposing fundamental deficiencies in spatiotemporal quantitative reasoning. STI-Bench establishes a rigorous, scalable foundation for future model development and evaluation in spatiotemporal understanding.

Technology Category

Application Category

📝 Abstract

The use of Multimodal Large Language Models (MLLMs) as an end-to-end solution for Embodied AI and Autonomous Driving has become a prevailing trend. While MLLMs have been extensively studied for visual semantic understanding tasks, their ability to perform precise and quantitative spatial-temporal understanding in real-world applications remains largely unexamined, leading to uncertain prospects. To evaluate models' Spatial-Temporal Intelligence, we introduce STI-Bench, a benchmark designed to evaluate MLLMs' spatial-temporal understanding through challenging tasks such as estimating and predicting the appearance, pose, displacement, and motion of objects. Our benchmark encompasses a wide range of robot and vehicle operations across desktop, indoor, and outdoor scenarios. The extensive experiments reveals that the state-of-the-art MLLMs still struggle in real-world spatial-temporal understanding, especially in tasks requiring precise distance estimation and motion analysis.

Problem

Research questions and friction points this paper is trying to address.

Evaluating MLLMs' spatial-temporal understanding in real-world applications

Assessing MLLMs' precision in distance estimation and motion analysis

Benchmarking MLLMs' performance in diverse robot and vehicle operations

Innovation

Methods, ideas, or system contributions that make the work stand out.

STI-Bench evaluates MLLMs' spatial-temporal understanding

Benchmark includes object appearance, pose, displacement tasks

Tests MLLMs in desktop, indoor, outdoor scenarios

🔎 Similar Papers

Time Awareness in Large Language Models: Benchmarking Fact Recall Across Time