🤖 AI Summary
Robotic grasping and placement in retail environments—such as convenience stores—are challenged by densely packed objects, severe occlusions, and high visual diversity.
Method: This paper proposes an annotation-guided perception–action framework that jointly leverages bounding-box annotations to indicate both graspable objects and target placement locations, enabling structured spatial guidance. It integrates action chunking with the Action Chunking Transformer (ACT) architecture to learn, end-to-end, the mapping from annotated images to continuous action sequences—without requiring explicit 3D reconstruction or physics simulation. The framework relies solely on human demonstration data and lightweight visual cues.
Results: Experiments in real-world retail settings demonstrate a significant improvement in grasping success rate, along with enhanced operational fluency and robustness over baseline methods. These results validate the effectiveness of co-designing annotation-guided visual prompting with behavior cloning.
📝 Abstract
Robotic pick-and-place tasks in convenience stores pose challenges due to dense object arrangements, occlusions, and variations in object properties such as color, shape, size, and texture. These factors complicate trajectory planning and grasping. This paper introduces a perception-action pipeline leveraging annotation-guided visual prompting, where bounding box annotations identify both pickable objects and placement locations, providing structured spatial guidance. Instead of traditional step-by-step planning, we employ Action Chunking with Transformers (ACT) as an imitation learning algorithm, enabling the robotic arm to predict chunked action sequences from human demonstrations. This facilitates smooth, adaptive, and data-driven pick-and-place operations. We evaluate our system based on success rate and visual analysis of grasping behavior, demonstrating improved grasp accuracy and adaptability in retail environments.