🤖 AI Summary
Multimodal large language models (MLLMs) exhibit limited capability in spatiotemporal reasoning tasks requiring joint understanding of static environmental context and dynamic, recent actions—hindering perception-decision consistency for real-world intelligent agents. To address this, we propose the Spatiotemporal Large Language Model (ST-LLM), which introduces a novel dual-path projector architecture explicitly modeling the synergistic relationship between static spatial representations and temporally evolving action sequences. ST-LLM is trained on REA—a newly curated video-text instruction dataset designed for reasoning about environments and actions—using a dual-stream encoder with explicit spatial and temporal pathways, coupled with cross-modal alignment. Experiments demonstrate that ST-LLM significantly outperforms state-of-the-art MLLMs on the REA benchmark, validating its enhanced capacity for holistic spatiotemporal joint reasoning. Both code and data are publicly released.
📝 Abstract
Despite the significant recent progress of Multimodal Large Language Models (MLLMs), MLLMs still struggle to correctly answer prompts that require a holistic spatio-temporal understanding. Specifically, it is challenging to address prompts that refer to 1) the entirety of an environment that an agent equipped with an MLLM can operate in; and simultaneously also refer to 2) recent actions that just happened and are encoded in a video clip. However, such a holistic spatio-temporal understanding is important for agents operating in the real world. To address this issue, we first develop a framework to collect a large-scale dataset. Using the collected "Reasoning about Environments and Actions" (REA) dataset, we show that recent methods indeed struggle to correctly answer the prompts. To improve, we develop a "spatio-temporal LLM" (ST-LLM), a model equipped with projectors to improve both spatial understanding of an environment and temporal understanding of recent observations. On the collected REA data, we show that the proposed method significantly improves results compared to prior work. Code and data are available at https://zoezheng126.github.io/STLLM-website/.