π€ AI Summary
Robots must accurately grasp specific object parts based on natural language instructions to enable human-robot collaboration and tool manipulation. Existing approaches struggle with open-vocabulary part-level grasping. This paper introduces the first open-vocabulary part-level grasping framework, integrating GLIP (open-vocabulary object detection), MaskCLIP (part-aware segmentation), and a geometry-aware 6-DoF grasp pose regression network. We further construct the first manually annotated part-level segmentation dataset (1,014 samples) and a real-world part-grasping dataset. Our system localizes target parts and predicts full 6-DoF grasp poses within 800 ms. Evaluated on 28 household object categories across 360 physical trials, it achieves a grasp success rate of 69.52% and a part localization accuracy of 88.57%, significantly outperforming baseline methods.
π Abstract
Many robotic applications require to grasp objects not arbitrarily but at a very specific object part. This is especially important for manipulation tasks beyond simple pick-and-place scenarios or in robot-human interactions, such as object handovers. We propose AnyPart, a practical system that combines open-vocabulary object detection, open-vocabulary part segmentation and 6DOF grasp pose prediction to infer a grasp pose on a specific part of an object in 800 milliseconds. We contribute two new datasets for the task of open-vocabulary part-based grasping, a hand-segmented dataset containing 1014 object-part segmentations, and a dataset of real-world scenarios gathered during our robot trials for individual objects and table-clearing tasks. We evaluate AnyPart on a mobile manipulator robot using a set of 28 common household objects over 360 grasping trials. AnyPart is capable of producing successful grasps 69.52 %, when ignoring robot-based grasp failures, AnyPart predicts a grasp location on the correct part 88.57 % of the time.