Benchmarking Spatiotemporal Reasoning in LLMs and Reasoning Models: Capabilities and Challenges

📅 2025-05-16

📈 Citations: 0

✨ Influential: 0

career value

195K/year

🤖 AI Summary

This work addresses the limitations of large language models (LLMs) and large reasoning models (LRMs) in complex spatiotemporal signal understanding. To this end, we introduce STARK—the first hierarchical spatiotemporal reasoning benchmark—featuring a three-tier evaluation framework: state estimation, spatiotemporal relational reasoning, and world-knowledge integration. STARK encompasses 26 task categories, 14,552 challenging instances, supports both natural-language responses and executable Python code, and incorporates multimodal sensing data and real-world cyber-physical system (CPS) scenarios. It uniquely unifies assessment of geometric reasoning, multimodal perception, and domain-knowledge integration. Experimental results show that LRMs significantly outperform LLMs on geometry-intensive tasks (e.g., multilateration), achieving >75% average success versus <40% for LLMs; conversely, LLMs match or exceed LRMs on knowledge-intensive tasks (e.g., intent prediction). The o3 model emerges as the current state-of-the-art LRM.

Technology Category

Application Category

📝 Abstract

Spatiotemporal reasoning plays a key role in Cyber-Physical Systems (CPS). Despite advances in Large Language Models (LLMs) and Large Reasoning Models (LRMs), their capacity to reason about complex spatiotemporal signals remains underexplored. This paper proposes a hierarchical SpatioTemporal reAsoning benchmaRK, STARK, to systematically evaluate LLMs across three levels of reasoning complexity: state estimation (e.g., predicting field variables, localizing and tracking events in space and time), spatiotemporal reasoning over states (e.g., inferring spatial-temporal relationships), and world-knowledge-aware reasoning that integrates contextual and domain knowledge (e.g., intent prediction, landmark-aware navigation). We curate 26 distinct spatiotemporal tasks with diverse sensor modalities, comprising 14,552 challenges where models answer directly or by Python Code Interpreter. Evaluating 3 LRMs and 8 LLMs, we find LLMs achieve limited success in tasks requiring geometric reasoning (e.g., multilateration or triangulation), particularly as complexity increases. Surprisingly, LRMs show robust performance across tasks with various levels of difficulty, often competing or surpassing traditional first-principle-based methods. Our results show that in reasoning tasks requiring world knowledge, the performance gap between LLMs and LRMs narrows, with some LLMs even surpassing LRMs. However, the LRM o3 model continues to achieve leading performance across all evaluated tasks, a result attributed primarily to the larger size of the reasoning models. STARK motivates future innovations in model architectures and reasoning paradigms for intelligent CPS by providing a structured framework to identify limitations in the spatiotemporal reasoning of LLMs and LRMs.

Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs and LRMs on spatiotemporal reasoning capabilities

Assessing geometric reasoning limitations in complex scenarios

Comparing performance gaps between LLMs and LRMs with world knowledge

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical SpatioTemporal benchmark STARK for evaluation

Evaluates LLMs on 26 tasks with diverse sensors

LRMs outperform LLMs in complex geometric reasoning

🔎 Similar Papers

Time Awareness in Large Language Models: Benchmarking Fact Recall Across Time