Spatio-Temporal LLM: Reasoning about Environments and Actions

📅 2025-07-07
📈 Citations: 0
Influential: 0
📄 PDF

career value

188K/year
🤖 AI Summary
Multimodal large language models (MLLMs) exhibit limited capability in spatiotemporal reasoning tasks requiring joint understanding of static environmental context and dynamic, recent actions—hindering perception-decision consistency for real-world intelligent agents. To address this, we propose the Spatiotemporal Large Language Model (ST-LLM), which introduces a novel dual-path projector architecture explicitly modeling the synergistic relationship between static spatial representations and temporally evolving action sequences. ST-LLM is trained on REA—a newly curated video-text instruction dataset designed for reasoning about environments and actions—using a dual-stream encoder with explicit spatial and temporal pathways, coupled with cross-modal alignment. Experiments demonstrate that ST-LLM significantly outperforms state-of-the-art MLLMs on the REA benchmark, validating its enhanced capacity for holistic spatiotemporal joint reasoning. Both code and data are publicly released.

Technology Category

Application Category

📝 Abstract
Despite the significant recent progress of Multimodal Large Language Models (MLLMs), MLLMs still struggle to correctly answer prompts that require a holistic spatio-temporal understanding. Specifically, it is challenging to address prompts that refer to 1) the entirety of an environment that an agent equipped with an MLLM can operate in; and simultaneously also refer to 2) recent actions that just happened and are encoded in a video clip. However, such a holistic spatio-temporal understanding is important for agents operating in the real world. To address this issue, we first develop a framework to collect a large-scale dataset. Using the collected "Reasoning about Environments and Actions" (REA) dataset, we show that recent methods indeed struggle to correctly answer the prompts. To improve, we develop a "spatio-temporal LLM" (ST-LLM), a model equipped with projectors to improve both spatial understanding of an environment and temporal understanding of recent observations. On the collected REA data, we show that the proposed method significantly improves results compared to prior work. Code and data are available at https://zoezheng126.github.io/STLLM-website/.
Problem

Research questions and friction points this paper is trying to address.

MLLMs lack holistic spatio-temporal understanding of environments
Challenges in processing prompts with combined spatial and temporal references
Need for improved spatial and temporal reasoning in real-world agent operations
Innovation

Methods, ideas, or system contributions that make the work stand out.

ST-LLM model enhances spatio-temporal reasoning
Projectors improve spatial and temporal understanding
Large-scale REA dataset for holistic evaluation