InternVLA-M1: A Spatially Guided Vision-Language-Action Framework for Generalist Robot Policy

📅 2025-10-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the fundamental challenge of achieving generalization and scalability in instruction-following robots. We propose a spatially grounded vision-language-action unified framework, innovatively leveraging spatial grounding as the central nexus for both instruction understanding and action generation—enabling the first joint modeling of “where to act” (spatial localization) and “how to act” (action policy), and supporting plug-and-play control across heterogeneous robot morphologies. Our method employs a two-stage training paradigm: (1) spatial reasoning pretraining on 2.3 million samples, and (2) embodied action post-training with spatial prompting. To support learning, we develop a simulation engine to generate 244K pick-and-place task instances. Evaluated across multiple benchmarks, our approach achieves an average performance gain of 6.2%, improves zero-shot generalization to unseen objects and novel scenes by 20.6%, and outperforms prior methods by over 10% on long-horizon tasks.

Technology Category

Application Category

📝 Abstract
We introduce InternVLA-M1, a unified framework for spatial grounding and robot control that advances instruction-following robots toward scalable, general-purpose intelligence. Its core idea is spatially guided vision-language-action training, where spatial grounding serves as the critical link between instructions and robot actions. InternVLA-M1 employs a two-stage pipeline: (i) spatial grounding pre-training on over 2.3M spatial reasoning data to determine ``where to act''by aligning instructions with visual, embodiment-agnostic positions, and (ii) spatially guided action post-training to decide ``how to act''by generating embodiment-aware actions through plug-and-play spatial prompting. This spatially guided training recipe yields consistent gains: InternVLA-M1 outperforms its variant without spatial guidance by +14.6% on SimplerEnv Google Robot, +17% on WidowX, and +4.3% on LIBERO Franka, while demonstrating stronger spatial reasoning capability in box, point, and trace prediction. To further scale instruction following, we built a simulation engine to collect 244K generalizable pick-and-place episodes, enabling a 6.2% average improvement across 200 tasks and 3K+ objects. In real-world clustered pick-and-place, InternVLA-M1 improved by 7.3%, and with synthetic co-training, achieved +20.6% on unseen objects and novel configurations. Moreover, in long-horizon reasoning-intensive scenarios, it surpassed existing works by over 10%. These results highlight spatially guided training as a unifying principle for scalable and resilient generalist robots. Code and models are available at https://github.com/InternRobotics/InternVLA-M1.
Problem

Research questions and friction points this paper is trying to address.

Developing spatially guided vision-language-action framework for robot control
Creating generalist robot policy for scalable instruction-following intelligence
Improving spatial reasoning and action generation in robotic systems
Innovation

Methods, ideas, or system contributions that make the work stand out.

Spatially guided vision-language-action training framework
Two-stage spatial grounding and action post-training pipeline
Plug-and-play spatial prompting for embodiment-aware actions
🔎 Similar Papers
No similar papers found.
X
Xinyi Chen
Intern Robotics, Shanghai AI Laboratory
Y
Yilun Chen
Intern Robotics, Shanghai AI Laboratory
Yanwei Fu
Yanwei Fu
Fudan University
Computer visionmachine learningMultimedia
N
Ning Gao
Intern Robotics, Shanghai AI Laboratory
Jiaya Jia
Jiaya Jia
Chair Professor, HKUST; Adjunct Prof., CUHK
Artificial IntelligenceComputer VisionDeep Learning
W
Weiyang Jin
Intern Robotics, Shanghai AI Laboratory
H
Hao Li
Intern Robotics, Shanghai AI Laboratory
Y
Yao Mu
Intern Robotics, Shanghai AI Laboratory
J
Jiangmiao Pang
Intern Robotics, Shanghai AI Laboratory
Y
Yu Qiao
Intern Robotics, Shanghai AI Laboratory
Y
Yang Tian
Intern Robotics, Shanghai AI Laboratory
B
Bin Wang
Intern Robotics, Shanghai AI Laboratory
B
Bolun Wang
Intern Robotics, Shanghai AI Laboratory
F
Fangjing Wang
Intern Robotics, Shanghai AI Laboratory
H
Hanqing Wang
Intern Robotics, Shanghai AI Laboratory
Tai Wang
Tai Wang
Shanghai AI Laboratory
Computer Vision3D VisionEmbodied AIDeep Learning
Ziqin Wang
Ziqin Wang
Beihang University
Embodied AIRoboticsLarge Lanugage ModelComputer Vision
X
Xueyuan Wei
Intern Robotics, Shanghai AI Laboratory
C
Chao Wu
Intern Robotics, Shanghai AI Laboratory
S
Shuai Yang
Intern Robotics, Shanghai AI Laboratory
J
Jinhui Ye
Intern Robotics, Shanghai AI Laboratory
J
Junqiu Yu
Intern Robotics, Shanghai AI Laboratory
J
Jia Zeng
Intern Robotics, Shanghai AI Laboratory
J
Jingjing Zhang
Intern Robotics, Shanghai AI Laboratory
J
Jinyu Zhang
Intern Robotics, Shanghai AI Laboratory
S
Shi Zhang
Intern Robotics, Shanghai AI Laboratory
F
Feng Zheng
Intern Robotics, Shanghai AI Laboratory
B
Bowen Zhou
Intern Robotics, Shanghai AI Laboratory
Y
Yangkun Zhu
Intern Robotics, Shanghai AI Laboratory