$pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization

📅 2025-04-22

📈 Citations: 0

✨ Influential: 0

career value

226K/year

🤖 AI Summary

This work addresses the limited generalization capability of Vision-Language-Action (VLA) models in real-world, open domestic environments. To this end, we propose a heterogeneous task-cooperative training paradigm. Methodologically, we integrate multi-robot platform data, web-sourced semantic resources, and hierarchical action representations to construct a hybrid multimodal joint training framework that unifies object detection, semantic subtask prediction, and low-level action generation. We further design a cross-platform VLA architecture enabling zero-shot transfer. Our approach achieves, for the first time, successful long-horizon, dexterous manipulation tasks—such as kitchen and bedroom cleaning—in entirely unseen household scenes. This yields substantial improvements in open-world generalization. The framework advances end-to-end embodied intelligence toward practical deployment by bridging the gap between simulation-trained models and real-world adaptability.

Technology Category

Application Category

📝 Abstract

In order for robots to be useful, they must perform practically relevant tasks in the real world, outside of the lab. While vision-language-action (VLA) models have demonstrated impressive results for end-to-end robot control, it remains an open question how far such models can generalize in the wild. We describe $pi_{0.5}$, a new model based on $pi_{0}$ that uses co-training on heterogeneous tasks to enable broad generalization. $pi_{0.5}$ uses data from multiple robots, high-level semantic prediction, web data, and other sources to enable broadly generalizable real-world robotic manipulation. Our system uses a combination of co-training and hybrid multi-modal examples that combine image observations, language commands, object detections, semantic subtask prediction, and low-level actions. Our experiments show that this kind of knowledge transfer is essential for effective generalization, and we demonstrate for the first time that an end-to-end learning-enabled robotic system can perform long-horizon and dexterous manipulation skills, such as cleaning a kitchen or bedroom, in entirely new homes.

Problem

Research questions and friction points this paper is trying to address.

Enabling robots to perform real-world tasks outside labs

Improving vision-language-action models' open-world generalization

Achieving long-horizon manipulation in unseen environments

Innovation

Methods, ideas, or system contributions that make the work stand out.

Co-training on heterogeneous tasks for generalization

Multi-modal data integration for robotic manipulation

End-to-end learning for long-horizon manipulation skills

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs