ST4VLA: Spatially Guided Training for Vision-Language-Action Models

📅 2026-02-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge that large vision-language models struggle to reliably map high-level instructions to low-level actions in embodied tasks, often due to inconsistent spatial perception and action generation. To overcome this limitation, the authors propose ST4VLA, a two-stage framework: first, it pretrains a spatial localization module via point, bounding box, and trajectory prediction to learn visual-spatial priors; then, during action-oriented fine-tuning, it introduces a spatial prompting mechanism to guide the policy network toward generating actions aligned with spatial targets. This approach enables joint optimization of spatial understanding and action decision-making, achieving significant performance gains on the SimplerEnv benchmark—improving success rates from 66.1 to 84.6 on Google Robot and from 54.7 to 73.2 on WidowX—and demonstrates strong generalization under unseen objects, rephrased instructions, and long-horizon perturbations.

Technology Category

Application Category

📝 Abstract
Large vision-language models (VLMs) excel at multimodal understanding but fall short when extended to embodied tasks, where instructions must be transformed into low-level motor actions. We introduce ST4VLA, a dual-system Vision-Language-Action framework that leverages Spatial Guided Training to align action learning with spatial priors in VLMs. ST4VLA includes two stages: (i) spatial grounding pre-training, which equips the VLM with transferable priors via scalable point, box, and trajectory prediction from both web-scale and robot-specific data, and (ii) spatially guided action post-training, which encourages the model to produce richer spatial priors to guide action generation via spatial prompting. This design preserves spatial grounding during policy learning and promotes consistent optimization across spatial and action objectives. Empirically, ST4VLA achieves substantial improvements over vanilla VLA, with performance increasing from 66.1 ->84.6 on Google Robot and from 54.7 ->73.2 on WidowX Robot, establishing new state-of-the-art results on SimplerEnv. It also demonstrates stronger generalization to unseen objects and paraphrased instructions, as well as robustness to long-horizon perturbations in real-world settings. These results highlight scalable spatially guided training as a promising direction for robust, generalizable robot learning. Source code, data and models are released at https://internrobotics.github.io/internvla-m1.github.io/
Problem

Research questions and friction points this paper is trying to address.

vision-language-action
embodied AI
spatial grounding
robot learning
action generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Spatially Guided Training
Vision-Language-Action Models
Spatial Grounding
Robot Learning
Multimodal Alignment
🔎 Similar Papers
No similar papers found.
J
Jinhui Ye
Shanghai AI Laboratory
F
Fangjing Wang
Southern University of Science and Technology
N
Ning Gao
Shanghai AI Laboratory
J
Junqiu Yu
Shanghai AI Laboratory
Y
Yangkun Zhu
Shanghai AI Laboratory
Bin Wang
Bin Wang
Pengcheng Laboratory
Cloud ComputingIIoTGreen ComputingComputer Architecture
J
Jinyu Zhang
Shanghai AI Laboratory
W
Weiyang Jin
Shanghai AI Laboratory
Yanwei Fu
Yanwei Fu
Fudan University
Computer visionmachine learningMultimedia
Feng Zheng
Feng Zheng
Southern University of Science and Technology; Spatialtemporal AI
Embodied IntelligenceSpatialtemporal AIComputer Vision
Yilun Chen
Yilun Chen
Shanghai AI Laboratory
Autonomous DrivingEmbodied AIComputer Vision
J
Jiangmiao Pang
Shanghai AI Laboratory