π€ AI Summary
This work addresses the challenge of unstable robotic grasping from a single-view image due to incomplete geometric information by proposing a two-stage grasping framework. The approach first generates initial grasp poses through superquadric-based cross-object similarity matching and then refines grasp quality via an end-to-end region-aware network, E-RNet, which evaluates and optimizes grasps anchored on the gripperβs closing region. Key innovations include a superquadric-guided similarity matching mechanism for cross-object generalization and a region-aware architecture explicitly designed for grasp stability. Experimental results demonstrate that the method achieves high success rates and strong generalization across diverse unknown objects and complex environments in both simulation and real-world settings.
π Abstract
Robotic grasping from single-view observations remains a critical challenge in manipulation. Existing methods still struggle to generate stable and valid grasp poses when confronted with incomplete geometric information. To address these limitations, we propose SuperGrasp, a novel two-stage framework for single-view grasping with parallel-jaw grippers that decomposes the grasping process into initial grasp pose generation and subsequent grasp evaluation and refinement. In the first stage, we introduce a Similarity Matching Module that efficiently retrieves grasp candidates by matching the input single-view point cloud with a pre-computed primitive dataset based on superquadric coefficients. In the second stage, we propose E-RNet, an end-to-end network that expands the graspaware region and takes the initial grasp closure region as a local anchor region, enabling more accurate and reliable evaluation and refinement of grasp candidates. To enhance generalization, we construct a primitive dataset containing 1.5k primitives for similarity matching and collect a large-scale point cloud dataset with 100k stable grasp labels from 124 objects for network training. Extensive experiments in both simulation and realworld environments demonstrate that our method achieves stable grasping performance and strong generalization across varying scenes and novel objects.