RoboStream: Weaving Spatio-Temporal Reasoning with Memory in Vision-Language Models for Robotics

📅 2026-03-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing vision-language models lack spatiotemporal reasoning and memory mechanisms for long-horizon robotic manipulation, often failing to track task states due to occlusions or accumulated action errors. This work proposes a training-free framework that, for the first time, integrates human-like causal spatiotemporal reasoning and persistent memory into vision-language robotic planning. The approach employs Spatiotemporal Fusion Tokens (STF-Tokens) to anchor 3D geometric information and constructs a Causal Spatiotemporal Graph (CSTG) to model cross-step state transitions, enabling continuous object localization and causal chain tracing. Evaluated on long-horizon RLBench tasks, the method achieves a 90.5% success rate, and in real-world block-stacking tasks, it attains 44.4%—substantially outperforming SoFar and VoxPoser, both of which achieve only 11.1%.

Technology Category

Application Category

📝 Abstract
Enabling reliable long-horizon robotic manipulation is a crucial step toward open-world embodied intelligence. However, VLM-based planners treat each step as an isolated observation-to-action mapping, forcing them to reinfer scene geometry from raw pixels at every decision point while remaining unaware of how prior actions have reshaped the environment. Despite strong short-horizon performance, these systems lack the spatio-temporal reasoning required for persistent geometric anchoring and memory of action-triggered state transitions. Without persistent state tracking, perceptual errors accumulate across the execution horizon, temporarily occluded objects are catastrophically forgotten, and these compounding failures lead to precondition violations that cascade through subsequent steps. In contrast, humans maintain a persistent mental model that continuously tracks spatial relations and action consequences across interactions rather than reconstructing them at each instant. Inspired by this human capacity for causal spatio-temporal reasoning with persistent memory, we propose RoboStream, a training-free framework that achieves geometric anchoring through Spatio-Temporal Fusion Tokens (STF-Tokens), which bind visual evidence to 3D geometric attributes for persistent object grounding, and maintains causal continuity via a Causal Spatio-Temporal Graph (CSTG) that records action-triggered state transitions across steps. This design enables the planner to trace causal chains and preserve object permanence under occlusion without additional training or fine-tuning. RoboStream achieves 90.5% on long-horizon RLBench and 44.4% on challenging real-world block-building tasks, where both SoFar and VoxPoser score 11.1%, demonstrating that spatio-temporal reasoning and causal memory are critical missing components for reliable long-horizon manipulation.
Problem

Research questions and friction points this paper is trying to address.

spatio-temporal reasoning
memory
long-horizon manipulation
object permanence
state tracking
Innovation

Methods, ideas, or system contributions that make the work stand out.

Spatio-Temporal Reasoning
Persistent Memory
Geometric Anchoring
Causal State Tracking
Training-Free Framework
🔎 Similar Papers
No similar papers found.
Y
Yuzhi Huang
Shenzhen International Graduate School, Tsinghua University
Jie Wu
Jie Wu
SIGS, Tsinghua University
Code Generation
W
Weijue Bu
China University of Mining and Technology
Z
Ziyi Xiong
Shenzhen International Graduate School, Tsinghua University
G
Gaoyang Jiang
Huazhong University of Science and Technology
Ye Li
Ye Li
Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences
MRI
K
Kangye Ji
Shenzhen International Graduate School, Tsinghua University
Shuzhao Xie
Shuzhao Xie
Tsinghua University
GraphicsMultimedia
Yue Huang
Yue Huang
Professor, Xiamen University
signal processingimage processingmachine learning
C
Chenglei Wu
YXGN Robotics
Jingyan Jiang
Jingyan Jiang
Shen Zhen Technology University
Test-time adaptation, Embodied AI,Machine learning system
Zhi Wang
Zhi Wang
Associate Professor, SIGS, Tsinghua University
multimedia networkedge computingdistributed machine learning