🤖 AI Summary
Existing mobile manipulation frameworks decouple navigation from manipulation, leading to task failures due to misaligned approach angles and limiting generalization. This work proposes an object-centric, orientation-robust manipulation paradigm—the first to integrate the SAM2 foundation model into mobile manipulation—unifying orientation-aware promptable segmentation with manipulation semantics for cross-view task understanding and execution. Methodologically, we design an end-to-end policy comprising: (i) SAM2-based orientation-aware visual segmentation; (ii) orientation-conditioned imitation learning; and (iii) action-sequence modeling, trained on a custom-built mobile manipulation platform. Experiments on multi-angle pick-and-place tasks demonstrate that our approach significantly outperforms Action Chunking Transformer in both generalization and robustness, particularly under viewpoint variation. This establishes a novel, scalable paradigm for general-purpose mobile manipulation robots.
📝 Abstract
Imitation learning for mobile manipulation is a key challenge in the field of robotic manipulation. However, current mobile manipulation frameworks typically decouple navigation and manipulation, executing manipulation only after reaching a certain location. This can lead to performance degradation when navigation is imprecise, especially due to misalignment in approach angles. To enable a mobile manipulator to perform the same task from diverse orientations, an essential capability for building general-purpose robotic models, we propose an object-centric method based on SAM2, a foundation model towards solving promptable visual segmentation in images, which incorporates manipulation orientation information into our model. Our approach enables consistent understanding of the same task from different orientations. We deploy the model on a custom-built mobile manipulator and evaluate it on a pick-and-place task under varied orientation angles. Compared to Action Chunking Transformer, our model maintains superior generalization when trained with demonstrations from varied approach angles. This work significantly enhances the generalization and robustness of imitation learning-based mobile manipulation systems.