ST-$Ο€$: Structured SpatioTemporal VLA for Robotic Manipulation

πŸ“… 2026-04-20
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF

career value

200K/year
πŸ€– AI Summary
This work addresses the challenge that existing vision-language-action (VLA) models struggle to explicitly model fine-grained spatiotemporal boundaries in multi-stage tasks. The authors propose a structured spatiotemporal VLA framework that introduces, for the first time, a chunk-level spatiotemporal action prompting mechanism. This approach leverages a spatiotemporal vision-language model to generate structured action prompts and employs dual generators to separately capture spatial dependencies and temporal causality, enabling hierarchical reasoning from global planning to local control. The method integrates 4D observation encoding, large language model–guided chunked planning, and structured action prediction, and is evaluated on a newly curated real-world robotic dataset with spatiotemporal annotations. Experiments demonstrate that the model significantly outperforms current methods across multiple fine-grained manipulation tasks, effectively enhancing both complex temporal behavior understanding and precise action execution.

Technology Category

Application Category

πŸ“ Abstract
Vision-language-action (VLA) models have achieved great success on general robotic tasks, but still face challenges in fine-grained spatiotemporal manipulation. Typically, existing methods mainly embed spatiotemporal knowledge into visual and action representations, and directly perform a cross-modal mapping for step-level action prediction. However, such spatiotemporal reasoning remains largely implicit, making it difficult to handle multiple sequential behaviors with explicit spatiotemporal boundaries. In this work, we propose ST-$Ο€$, a structured spatiotemporal VLA model for robotic manipulation. Our model is guided by two key designs: 1) Spatiotemporal VLM. We encode 4D observations and task instructions into latent spaces, and feed them into the LLM to generate a sequence of causally ordered chunk-level action prompts consisting of sub-tasks, spatial grounding and temporal grounding. 2) Spatiotemporal action expert. Conditioned on chunk-level action prompts, we design a structured dual-generator guidance to jointly model spatial dependencies and temporal causality, thus predicting step-level action parameters. Within this structured framework, the VLM explicitly plans global spatiotemporal behavior, and the action expert further refines local spatiotemporal control. In addition, we propose a real-world robotic dataset with structured spatiotemporal annotations for fine-tuning. Extensive experiments have been conducted to demonstrate the effectiveness of our model. Our code link: https://github.com/chuanhaoma/ST-pi.
Problem

Research questions and friction points this paper is trying to address.

spatiotemporal manipulation
vision-language-action models
sequential behaviors
temporal causality
spatial grounding
Innovation

Methods, ideas, or system contributions that make the work stand out.

structured spatiotemporal reasoning
vision-language-action model
chunk-level action planning
dual-generator guidance
robotic manipulation