Visual Prompting for Robotic Manipulation with Annotation-Guided Pick-and-Place Using ACT

📅 2025-08-12

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

Robotic grasping and placement in retail environments—such as convenience stores—are challenged by densely packed objects, severe occlusions, and high visual diversity. Method: This paper proposes an annotation-guided perception–action framework that jointly leverages bounding-box annotations to indicate both graspable objects and target placement locations, enabling structured spatial guidance. It integrates action chunking with the Action Chunking Transformer (ACT) architecture to learn, end-to-end, the mapping from annotated images to continuous action sequences—without requiring explicit 3D reconstruction or physics simulation. The framework relies solely on human demonstration data and lightweight visual cues. Results: Experiments in real-world retail settings demonstrate a significant improvement in grasping success rate, along with enhanced operational fluency and robustness over baseline methods. These results validate the effectiveness of co-designing annotation-guided visual prompting with behavior cloning.

Technology Category

Application Category

📝 Abstract

Robotic pick-and-place tasks in convenience stores pose challenges due to dense object arrangements, occlusions, and variations in object properties such as color, shape, size, and texture. These factors complicate trajectory planning and grasping. This paper introduces a perception-action pipeline leveraging annotation-guided visual prompting, where bounding box annotations identify both pickable objects and placement locations, providing structured spatial guidance. Instead of traditional step-by-step planning, we employ Action Chunking with Transformers (ACT) as an imitation learning algorithm, enabling the robotic arm to predict chunked action sequences from human demonstrations. This facilitates smooth, adaptive, and data-driven pick-and-place operations. We evaluate our system based on success rate and visual analysis of grasping behavior, demonstrating improved grasp accuracy and adaptability in retail environments.

Problem

Research questions and friction points this paper is trying to address.

Challenges in robotic pick-and-place due to dense, occluded objects

Variations in object properties complicate grasping and planning

Need for adaptive, data-driven manipulation in retail environments

Innovation

Methods, ideas, or system contributions that make the work stand out.

Annotation-guided visual prompting for spatial guidance

Action Chunking with Transformers for sequence prediction

Data-driven pick-and-place operations in retail

🔎 Similar Papers

No similar papers found.