🤖 AI Summary
Existing visual grasping methods are limited to single-gripper configurations and struggle to generalize across diverse end-effectors. To address this, we propose the first vision-based grasping framework supporting real-time multi-gripper detection and zero-shot generalization. Methodologically, we design a hierarchical two-stage architecture: a Grasp Point Predictor (GPP) jointly encodes scene-wide features and gripper-specific parameters to generate candidate grasp points; an Angle-Width Predictor (AWP) refines grasp pose estimation using local patch features. We introduce cross-gripper contrastive learning and systematically expand a multi-gripper dataset to mitigate annotation scarcity. The framework natively supports integration with vision foundation models and provides a vision-language interface for semantic grasp specification. Experiments demonstrate state-of-the-art grasping success rates across unseen grippers, significantly faster inference than existing gripper-aware methods, and strong zero-shot generalization—validating both efficiency and broad applicability.
📝 Abstract
Most robotic grasping methods are typically designed for single gripper types, which limits their applicability in real-world scenarios requiring diverse end-effectors. We propose XGrasp, a real-time gripper-aware grasp detection framework that efficiently handles multiple gripper configurations. The proposed method addresses data scarcity by systematically augmenting existing datasets with multi-gripper annotations. XGrasp employs a hierarchical two-stage architecture. In the first stage, a Grasp Point Predictor (GPP) identifies optimal locations using global scene information and gripper specifications. In the second stage, an Angle-Width Predictor (AWP) refines the grasp angle and width using local features. Contrastive learning in the AWP module enables zero-shot generalization to unseen grippers by learning fundamental grasping characteristics. The modular framework integrates seamlessly with vision foundation models, providing pathways for future vision-language capabilities. The experimental results demonstrate competitive grasp success rates across various gripper types, while achieving substantial improvements in inference speed compared to existing gripper-aware methods. Project page: https://sites.google.com/view/xgrasp