🤖 AI Summary
This work addresses the challenge of autonomous grasp target selection and accurate 6DoF pose estimation for robots operating in stacked, structured environments (e.g., bricklaying, warehouse stacking), where objects are subject to multi-layer occlusion. We propose the first unified framework jointly optimizing object selection and pose estimation, incorporating a hierarchical selection policy that prioritizes unoccluded top-layer objects. To support systematic evaluation, we introduce the first dedicated benchmark dataset for stacked scenes and define a composite metric integrating selection rationality and pose accuracy. Our method builds upon a tightly coupled camera–IMU perception architecture, synergistically fusing geometric priors with deep learning features to enable robust stack-layer parsing and 6DoF pose regression. Extensive experiments on our custom dataset demonstrate significant improvements over baseline methods. Furthermore, real-world deployment in robotic brick grasping validates the approach’s practicality and reliability under challenging conditions—including variable illumination and partial occlusion.
📝 Abstract
Vision-based robotic object grasping is typically investigated in the context of isolated objects or unstructured object sets in bin picking scenarios. However, there are several settings, such as construction or warehouse automation, where a robot needs to interact with a structured object formation such as a stack. In this context, we define the problem of selecting suitable objects for grasping along with estimating an accurate 6DoF pose of these objects. To address this problem, we propose a camera-IMU based approach that prioritizes unobstructed objects on the higher layers of stacks and introduce a dataset for benchmarking and evaluation, along with a suitable evaluation metric that combines object selection with pose accuracy. Experimental results show that although our method can perform quite well, this is a challenging problem if a completely error-free solution is needed. Finally, we show results from the deployment of our method for a brick-picking application in a construction scenario.