๐ค AI Summary
This work investigates whether single-RGB-image-based 6D instance-level object pose estimation can serve as the sole perceptual input for robotic grasping. To this end, we establish the first systematic evaluation framework tailored to physics-based simulated grasping tasks, integrating both parallel-jaw grippers and underactuated robotic hands within Gazebo. Grasps are executed in closed-loop 3D using estimated poses, and five state-of-the-art open-source pose estimators are benchmarked on a subset of the BOP dataset. Unlike conventional offline pose accuracy metrics, our approach directly links pose estimation performance to task-level successโgrasp success rate. Experimental results show that several purely vision-based methods achieve over 70% stable grasp success in simulation, demonstrating their viability as lightweight, cost-effective perception solutions for grasping. This study bridges a critical gap between visual pose estimation and end-to-end robotic manipulation evaluation, providing a reproducible benchmark and principled algorithm selection guidance for vision-driven grasping systems.
๐ Abstract
We present a framework for evaluating 6-DoF instance-level object pose estimators, focusing on those that require a single RGB (not RGB-D) image as input. Besides gaining intuition about how accurate these estimators are, we are interested in the degree to which they can serve as the sole perception mechanism for robotic grasping. To assess this, we perform grasping trials in a physics-based simulator, using image-based pose estimates to guide a parallel gripper and an underactuated robotic hand in picking up 3D models of objects. Our experiments on a subset of the BOP (Benchmark for 6D Object Pose Estimation) dataset compare five open-source object pose estimators and provide insights that were missing from the literature.