🤖 AI Summary
In unstructured bin-picking, monocular RGB-based local geometric modeling suffers from poor accuracy and limited generalization due to the absence of depth sensors and CAD models. Method: This paper proposes a superquadric (SQ)-driven grasping framework that requires neither depth sensors nor CAD priors. It first estimates a dense point cloud from a single RGB image using foundation models; then introduces a global-local collaborative SQ fitting network to infer physically interpretable local geometric primitives; finally, employs an SQ-guided grasping sampling strategy to generate stable, feasible 6D grasps from a single viewpoint. The method integrates metric depth estimation, cross-platform synthetic data generation, and end-to-end optimization. Results: Evaluated on real robotic hardware, the approach achieves a 92% grasping success rate and demonstrates significantly improved robustness and generalization to unknown shapes, severe occlusions, and textureless objects.
📝 Abstract
Bin picking is a challenging robotic task due to occlusions and physical constraints that limit visual information for object recognition and grasping. Existing approaches often rely on known CAD models or prior object geometries, restricting generalization to novel or unknown objects. Other methods directly regress grasp poses from RGB-D data without object priors, but the inherent noise in depth sensing and the lack of object understanding make grasp synthesis and evaluation more difficult. Superquadrics (SQ) offer a compact, interpretable shape representation that captures the physical and graspability understanding of objects. However, recovering them from limited viewpoints is challenging, as existing methods rely on multiple perspectives for near-complete point cloud reconstruction, limiting their effectiveness in bin-picking. To address these challenges, we propose extbf{RGBSQGrasp}, a grasping framework that leverages superquadric shape primitives and foundation metric depth estimation models to infer grasp poses from a monocular RGB camera -- eliminating the need for depth sensors. Our framework integrates a universal, cross-platform dataset generation pipeline, a foundation model-based object point cloud estimation module, a global-local superquadric fitting network, and an SQ-guided grasp pose sampling module. By integrating these components, RGBSQGrasp reliably infers grasp poses through geometric reasoning, enhancing grasp stability and adaptability to unseen objects. Real-world robotic experiments demonstrate a 92% grasp success rate, highlighting the effectiveness of RGBSQGrasp in packed bin-picking environments.