3DLLM-Mem: Long-Term Spatial-Temporal Memory for Embodied 3D Large Language Model

📅 2025-05-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the limited long-horizon planning and embodied action capabilities of large language models (LLMs) in dynamic multi-room 3D environments, this paper introduces the first spatio-temporal joint memory modeling paradigm for embodied AI. We construct 3DMem-Bench, a large-scale, long-sequence embodied memory benchmark comprising over 26,000 realistic agent trajectories. Furthermore, we propose 3DLLM-Mem—a novel architecture featuring query-driven memory selection, cross-spatiotemporal attention, and a dual-stream fusion of working and episodic memory—to enable efficient encoding of 3D embodied observations and grounded action reasoning. Experiments demonstrate that our method achieves state-of-the-art performance on 3DMem-Bench’s out-of-distribution tasks, improving success rate by 16.5% over the strongest baseline. It significantly enhances robustness and generalization in long-horizon, cross-room navigation and manipulation tasks.

Technology Category

Application Category

📝 Abstract
Humans excel at performing complex tasks by leveraging long-term memory across temporal and spatial experiences. In contrast, current Large Language Models (LLMs) struggle to effectively plan and act in dynamic, multi-room 3D environments. We posit that part of this limitation is due to the lack of proper 3D spatial-temporal memory modeling in LLMs. To address this, we first introduce 3DMem-Bench, a comprehensive benchmark comprising over 26,000 trajectories and 2,892 embodied tasks, question-answering and captioning, designed to evaluate an agent's ability to reason over long-term memory in 3D environments. Second, we propose 3DLLM-Mem, a novel dynamic memory management and fusion model for embodied spatial-temporal reasoning and actions in LLMs. Our model uses working memory tokens, which represents current observations, as queries to selectively attend to and fuse the most useful spatial and temporal features from episodic memory, which stores past observations and interactions. Our approach allows the agent to focus on task-relevant information while maintaining memory efficiency in complex, long-horizon environments. Experimental results demonstrate that 3DLLM-Mem achieves state-of-the-art performance across various tasks, outperforming the strongest baselines by 16.5% in success rate on 3DMem-Bench's most challenging in-the-wild embodied tasks.
Problem

Research questions and friction points this paper is trying to address.

Enhancing LLMs' long-term spatial-temporal memory in 3D environments
Addressing inefficiency in dynamic multi-room 3D task planning
Improving memory fusion for embodied reasoning and actions
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces 3DMem-Bench for memory evaluation
Proposes 3DLLM-Mem for dynamic memory fusion
Uses working memory tokens for selective attention
W
Wenbo Hu
University of California, Los Angeles
Yining Hong
Yining Hong
Stanford
Computer ScienceEmbodied AIComputer VisionNatural Language Processing
Y
Yanjun Wang
University of California, Los Angeles
L
Leison Gao
University of California, Los Angeles
Z
Zibu Wei
University of California, Los Angeles
Xingcheng Yao
Xingcheng Yao
Moonshot AI
N
Nanyun Peng
University of California, Los Angeles
Yonatan Bitton
Yonatan Bitton
Research Scientist, Google
Vision-and-LanguageMultimodalText-to-ImageImage-Text AlignmentNLP
Idan Szpektor
Idan Szpektor
Google Research
NLPGenerative LLMsFactuality&Grounding
K
Kai-Wei Chang
University of California, Los Angeles