π€ AI Summary
This work addresses the challenge that existing vision-language-action (VLA) models struggle to explicitly model fine-grained spatiotemporal boundaries in multi-stage tasks. The authors propose a structured spatiotemporal VLA framework that introduces, for the first time, a chunk-level spatiotemporal action prompting mechanism. This approach leverages a spatiotemporal vision-language model to generate structured action prompts and employs dual generators to separately capture spatial dependencies and temporal causality, enabling hierarchical reasoning from global planning to local control. The method integrates 4D observation encoding, large language modelβguided chunked planning, and structured action prediction, and is evaluated on a newly curated real-world robotic dataset with spatiotemporal annotations. Experiments demonstrate that the model significantly outperforms current methods across multiple fine-grained manipulation tasks, effectively enhancing both complex temporal behavior understanding and precise action execution.
π Abstract
Vision-language-action (VLA) models have achieved great success on general robotic tasks, but still face challenges in fine-grained spatiotemporal manipulation. Typically, existing methods mainly embed spatiotemporal knowledge into visual and action representations, and directly perform a cross-modal mapping for step-level action prediction. However, such spatiotemporal reasoning remains largely implicit, making it difficult to handle multiple sequential behaviors with explicit spatiotemporal boundaries. In this work, we propose ST-$Ο$, a structured spatiotemporal VLA model for robotic manipulation. Our model is guided by two key designs: 1) Spatiotemporal VLM. We encode 4D observations and task instructions into latent spaces, and feed them into the LLM to generate a sequence of causally ordered chunk-level action prompts consisting of sub-tasks, spatial grounding and temporal grounding. 2) Spatiotemporal action expert. Conditioned on chunk-level action prompts, we design a structured dual-generator guidance to jointly model spatial dependencies and temporal causality, thus predicting step-level action parameters. Within this structured framework, the VLM explicitly plans global spatiotemporal behavior, and the action expert further refines local spatiotemporal control. In addition, we propose a real-world robotic dataset with structured spatiotemporal annotations for fine-tuning. Extensive experiments have been conducted to demonstrate the effectiveness of our model. Our code link: https://github.com/chuanhaoma/ST-pi.