Visual Prompting for Robotic Manipulation with Annotation-Guided Pick-and-Place Using ACT

📅 2025-08-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Robotic grasping and placement in retail environments—such as convenience stores—are challenged by densely packed objects, severe occlusions, and high visual diversity. Method: This paper proposes an annotation-guided perception–action framework that jointly leverages bounding-box annotations to indicate both graspable objects and target placement locations, enabling structured spatial guidance. It integrates action chunking with the Action Chunking Transformer (ACT) architecture to learn, end-to-end, the mapping from annotated images to continuous action sequences—without requiring explicit 3D reconstruction or physics simulation. The framework relies solely on human demonstration data and lightweight visual cues. Results: Experiments in real-world retail settings demonstrate a significant improvement in grasping success rate, along with enhanced operational fluency and robustness over baseline methods. These results validate the effectiveness of co-designing annotation-guided visual prompting with behavior cloning.

Technology Category

Application Category

📝 Abstract
Robotic pick-and-place tasks in convenience stores pose challenges due to dense object arrangements, occlusions, and variations in object properties such as color, shape, size, and texture. These factors complicate trajectory planning and grasping. This paper introduces a perception-action pipeline leveraging annotation-guided visual prompting, where bounding box annotations identify both pickable objects and placement locations, providing structured spatial guidance. Instead of traditional step-by-step planning, we employ Action Chunking with Transformers (ACT) as an imitation learning algorithm, enabling the robotic arm to predict chunked action sequences from human demonstrations. This facilitates smooth, adaptive, and data-driven pick-and-place operations. We evaluate our system based on success rate and visual analysis of grasping behavior, demonstrating improved grasp accuracy and adaptability in retail environments.
Problem

Research questions and friction points this paper is trying to address.

Challenges in robotic pick-and-place due to dense, occluded objects
Variations in object properties complicate grasping and planning
Need for adaptive, data-driven manipulation in retail environments
Innovation

Methods, ideas, or system contributions that make the work stand out.

Annotation-guided visual prompting for spatial guidance
Action Chunking with Transformers for sequence prediction
Data-driven pick-and-place operations in retail
🔎 Similar Papers
No similar papers found.
M
Muhammad A. Muttaqien
Embodied AI Research Team, National Institute of AIST, Tokyo, Japan
Tomohiro Motoda
Tomohiro Motoda
National Institute of Advanced Industrial Science and Technology (AIST)
Robotic manipulationdeep learning
R
Ryo Hanai
Embodied AI Research Team, National Institute of AIST, Tokyo, Japan
Yukiyasu Domae
Yukiyasu Domae
AIST
Machine visionManipulationAutomationExperiential autonomy