🤖 AI Summary
General-purpose robotic models struggle to execute multi-stage language-conditioned tasks in unseen environments and lack cross-embodiment zero-shot generalization capabilities. This work proposes a steerable multimodal contextual conditioning mechanism that integrates language instructions, subgoal images, and task metadata to construct a unified robotic foundation model. The approach effectively unifies demonstration data, suboptimal autonomous trajectories, and non-robotic data sources, enabling robust language-action alignment and cross-platform learning. Evaluated across diverse robotic platforms, the model accomplishes complex tasks—such as operating coffee machines and folding clothes—without any fine-tuning, achieving zero-shot performance comparable to specialized reinforcement learning methods trained with task-specific fine-tuning.
📝 Abstract
We present a new robotic foundation model, called $π_{0.7}$, that can enable strong out-of-the-box performance in a wide range of scenarios. $π_{0.7}$ can follow diverse language instructions in unseen environments, including multi-stage tasks with various kitchen appliances, provide zero-shot cross-embodiment generalization, for example enabling a robot to fold laundry without seeing the task before, and perform challenging tasks such as operating an espresso machine out of the box at a level of performance that matches much more specialized RL-finetuned models. The main idea behind $π_{0.7}$ is to use diverse context conditioning during training. This conditioning information, contained in the prompt, makes it possible to steer the model precisely to perform many tasks with different strategies. It is conditioned not just on a language command that describes what it should do, but on additional multimodal information that also describes the manner or strategy in which it should do it, including metadata about task performance and subgoal images. This enables $π_{0.7}$ to use very diverse data, including demonstrations, potentially suboptimal (autonomous) data including failures, and data from non-robot sources. Our experiments evaluate $π_{0.7}$ across numerous tasks with multiple robot platforms, on tasks that require speed and dexterity, language following, and compositional task generalization.