🤖 AI Summary
To address the limited generalization capability of universal robotic manipulation agents in real-world scenarios—particularly when encountering novel objects, unseen categories, or unknown backgrounds—this paper proposes a language-guided segmentation mask injection paradigm leveraging internet-scale vision foundation models. Our method integrates language- and reasoning-driven segmentation masks (e.g., outputs from SAM or LLaVA) as strong semantic-geometric-temporal priors directly into an end-to-end policy network—a first-of-its-kind integration. We further design a dual-stream 2D convolutional imitation learning architecture to enable local-global collaborative perception and cross-modal alignment across vision, language, and action. Evaluated on a Franka Emika physical robot platform, our approach achieves significant improvements in cross-object, cross-category, and cross-background manipulation generalization and robustness using only a few demonstrations, outperforming image-only baseline policies.
📝 Abstract
Improving the generalization capabilities of general-purpose robotic manipulation agents in the real world has long been a significant challenge. Existing approaches often rely on collecting large-scale robotic data which is costly and time-consuming, such as the RT-1 dataset. However, due to insufficient diversity of data, these approaches typically suffer from limiting their capability in open-domain scenarios with new objects and diverse environments. In this paper, we propose a novel paradigm that effectively leverages language-reasoning segmentation mask generated by internet-scale foundation models, to condition robot manipulation tasks. By integrating the mask modality, which incorporates semantic, geometric, and temporal correlation priors derived from vision foundation models, into the end-to-end policy model, our approach can effectively and robustly perceive object pose and enable sample-efficient generalization learning, including new object instances, semantic categories, and unseen backgrounds. We first introduce a series of foundation models to ground natural language demands across multiple tasks. Secondly, we develop a two-stream 2D policy model based on imitation learning, which processes raw images and object masks to predict robot actions with a local-global perception manner. Extensive realworld experiments conducted on a Franka Emika robot arm demonstrate the effectiveness of our proposed paradigm and policy architecture. Demos can be found in our submitted video, and more comprehensive ones can be found in link1 or link2.