$pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization

📅 2025-04-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limited generalization capability of Vision-Language-Action (VLA) models in real-world, open domestic environments. To this end, we propose a heterogeneous task-cooperative training paradigm. Methodologically, we integrate multi-robot platform data, web-sourced semantic resources, and hierarchical action representations to construct a hybrid multimodal joint training framework that unifies object detection, semantic subtask prediction, and low-level action generation. We further design a cross-platform VLA architecture enabling zero-shot transfer. Our approach achieves, for the first time, successful long-horizon, dexterous manipulation tasks—such as kitchen and bedroom cleaning—in entirely unseen household scenes. This yields substantial improvements in open-world generalization. The framework advances end-to-end embodied intelligence toward practical deployment by bridging the gap between simulation-trained models and real-world adaptability.

Technology Category

Application Category

📝 Abstract
In order for robots to be useful, they must perform practically relevant tasks in the real world, outside of the lab. While vision-language-action (VLA) models have demonstrated impressive results for end-to-end robot control, it remains an open question how far such models can generalize in the wild. We describe $pi_{0.5}$, a new model based on $pi_{0}$ that uses co-training on heterogeneous tasks to enable broad generalization. $pi_{0.5}$ uses data from multiple robots, high-level semantic prediction, web data, and other sources to enable broadly generalizable real-world robotic manipulation. Our system uses a combination of co-training and hybrid multi-modal examples that combine image observations, language commands, object detections, semantic subtask prediction, and low-level actions. Our experiments show that this kind of knowledge transfer is essential for effective generalization, and we demonstrate for the first time that an end-to-end learning-enabled robotic system can perform long-horizon and dexterous manipulation skills, such as cleaning a kitchen or bedroom, in entirely new homes.
Problem

Research questions and friction points this paper is trying to address.

Enabling robots to perform real-world tasks outside labs
Improving vision-language-action models' open-world generalization
Achieving long-horizon manipulation in unseen environments
Innovation

Methods, ideas, or system contributions that make the work stand out.

Co-training on heterogeneous tasks for generalization
Multi-modal data integration for robotic manipulation
End-to-end learning for long-horizon manipulation skills
🔎 Similar Papers
No similar papers found.
K
Kevin Black
N
Noah Brown
J
James Darpinian
K
Karan Dhabalia
Danny Driess
Danny Driess
Google DeepMind
Machine LearningRobotics
A
Adnan Esmail
Michael Equi
Michael Equi
UC Berkeley
Machine learningrobot learning
Chelsea Finn
Chelsea Finn
Stanford University, Physical Intelligence
machine learningroboticsreinforcement learning
N
Niccolo Fusai
M
Manuel Y. Galliker
Dibya Ghosh
Dibya Ghosh
UC Berkeley
machine learningreinforcement learningoptimization
L
Lachy Groom
Karol Hausman
Karol Hausman
Physical Intelligence, Stanford
machine learningroboticsreinforcement learning
Brian Ichter
Brian Ichter
Physical Intelligence
RoboticsMachine LearningFoundation Models
S
Szymon Jakubczak
T
Tim Jones
Liyiming Ke
Liyiming Ke
Physical Intelligence
D
Devin LeBlanc
Sergey Levine
Sergey Levine
UC Berkeley, Physical Intelligence
Machine LearningRoboticsReinforcement Learning
A
Adrian Li-Bell
M
Mohith Mothukuri
S
Suraj Nair
Karl Pertsch
Karl Pertsch
UC Berkeley, Stanford University
Artificial IntelligenceMachine LearningRobotics
Allen Z. Ren
Allen Z. Ren
Physical Intelligence
RoboticsMachine Learning
Lucy Xiaoyang Shi
Lucy Xiaoyang Shi
Stanford University, Physical Intelligence
Machine LearningRoboticsReinforcement Learning
Laura Smith
Laura Smith
Jost Tobias Springenberg
Jost Tobias Springenberg
Google DeepMind
Machine Learning
Kyle Stachowicz
Kyle Stachowicz
UC Berkeley
Reinforcement LearningLearning-based ControlRobotics
James Tanner
James Tanner
University of Glasgow
phoneticsphonologycorpus linguistics
Quan Vuong
Quan Vuong
Physical Intelligence
Reinforcement LearningComputer Vision
H
H. Walke
A
Anna Walling
H
Haohuan Wang
Lili Yu
Lili Yu
Meta AI
Natural Language ProcessingMachine learning
U
Ury Zhilinsky