Spatio-Temporal LLM: Reasoning about Environments and Actions

📅 2025-07-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Multimodal large language models (MLLMs) exhibit limited capability in spatiotemporal reasoning tasks requiring joint understanding of static environmental context and dynamic, recent actions—hindering perception-decision consistency for real-world intelligent agents. To address this, we propose the Spatiotemporal Large Language Model (ST-LLM), which introduces a novel dual-path projector architecture explicitly modeling the synergistic relationship between static spatial representations and temporally evolving action sequences. ST-LLM is trained on REA—a newly curated video-text instruction dataset designed for reasoning about environments and actions—using a dual-stream encoder with explicit spatial and temporal pathways, coupled with cross-modal alignment. Experiments demonstrate that ST-LLM significantly outperforms state-of-the-art MLLMs on the REA benchmark, validating its enhanced capacity for holistic spatiotemporal joint reasoning. Both code and data are publicly released.

Technology Category

Application Category

📝 Abstract
Despite the significant recent progress of Multimodal Large Language Models (MLLMs), MLLMs still struggle to correctly answer prompts that require a holistic spatio-temporal understanding. Specifically, it is challenging to address prompts that refer to 1) the entirety of an environment that an agent equipped with an MLLM can operate in; and simultaneously also refer to 2) recent actions that just happened and are encoded in a video clip. However, such a holistic spatio-temporal understanding is important for agents operating in the real world. To address this issue, we first develop a framework to collect a large-scale dataset. Using the collected "Reasoning about Environments and Actions" (REA) dataset, we show that recent methods indeed struggle to correctly answer the prompts. To improve, we develop a "spatio-temporal LLM" (ST-LLM), a model equipped with projectors to improve both spatial understanding of an environment and temporal understanding of recent observations. On the collected REA data, we show that the proposed method significantly improves results compared to prior work. Code and data are available at https://zoezheng126.github.io/STLLM-website/.
Problem

Research questions and friction points this paper is trying to address.

MLLMs lack holistic spatio-temporal understanding of environments
Challenges in processing prompts with combined spatial and temporal references
Need for improved spatial and temporal reasoning in real-world agent operations
Innovation

Methods, ideas, or system contributions that make the work stand out.

ST-LLM model enhances spatio-temporal reasoning
Projectors improve spatial and temporal understanding
Large-scale REA dataset for holistic evaluation
🔎 Similar Papers
No similar papers found.
H
Haozhen Zheng
University of Illinois Urbana-Champaign
Beitong Tian
Beitong Tian
University of Illinois at Urbana-Champaign
Sensor NetworkEmbedded System
M
Mingyuan Wu
University of Illinois Urbana-Champaign
Z
Zhenggang Tang
University of Illinois Urbana-Champaign
Klara Nahrstedt
Klara Nahrstedt
Computer Science, University of Illinois, Urbana-Champaign
Quality of Servicemultimedia systemsdistributed systemsnetworksteleimmersion
A
Alex Schwing
University of Illinois Urbana-Champaign