Zero-Shot Peg Insertion: Identifying Mating Holes and Estimating SE(2) Poses with Vision-Language Models

📅 2025-03-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses zero-shot peg-in-hole insertion—robust assembly of unseen peg-hole pairs without task-specific training. We propose the first end-to-end solution leveraging vision-language models (VLMs), specifically CLIP-based architectures, to perform zero-shot hole detection, cross-domain matching discrimination, and SE(2) pose regression, integrated with closed-loop robotic control for insertion. Our key contributions are: (i) the first application of VLMs to zero-shot peg-in-hole, eliminating reliance on geometric priors or paired annotations; and (ii) strong generalization across heterogeneous domains—including 3D-printed parts, toys, and industrial connectors. Experiments demonstrate 90.2% matching accuracy on unseen peg-hole test sets and an 88.3% successful insertion rate on real PC backplane connectors, significantly outperforming supervised baselines.

Technology Category

Application Category

📝 Abstract
Achieving zero-shot peg insertion, where inserting an arbitrary peg into an unseen hole without task-specific training, remains a fundamental challenge in robotics. This task demands a highly generalizable perception system capable of detecting potential holes, selecting the correct mating hole from multiple candidates, estimating its precise pose, and executing insertion despite uncertainties. While learning-based methods have been applied to peg insertion, they often fail to generalize beyond the specific peg-hole pairs encountered during training. Recent advancements in Vision-Language Models (VLMs) offer a promising alternative, leveraging large-scale datasets to enable robust generalization across diverse tasks. Inspired by their success, we introduce a novel zero-shot peg insertion framework that utilizes a VLM to identify mating holes and estimate their poses without prior knowledge of their geometry. Extensive experiments demonstrate that our method achieves 90.2% accuracy, significantly outperforming baselines in identifying the correct mating hole across a wide range of previously unseen peg-hole pairs, including 3D-printed objects, toy puzzles, and industrial connectors. Furthermore, we validate the effectiveness of our approach in a real-world connector insertion task on a backpanel of a PC, where our system successfully detects holes, identifies the correct mating hole, estimates its pose, and completes the insertion with a success rate of 88.3%. These results highlight the potential of VLM-driven zero-shot reasoning for enabling robust and generalizable robotic assembly.
Problem

Research questions and friction points this paper is trying to address.

Achieving zero-shot peg insertion without task-specific training.
Identifying correct mating holes and estimating their precise poses.
Generalizing across diverse peg-hole pairs using Vision-Language Models.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses Vision-Language Models for hole identification
Estimates SE(2) poses without prior geometry knowledge
Achieves high accuracy in zero-shot peg insertion
🔎 Similar Papers
No similar papers found.
M
Masaru Yajima
Institute of Science Tokyo, Tokyo, Japan
Kei Ota
Kei Ota
AIRoA
RoboticsReinforcement Learning
Asako Kanezaki
Asako Kanezaki
Tokyo Institute of Technology
Computer VisionObject RecognitionShape Matching
R
Rei Kawakami
Institute of Science Tokyo, Tokyo, Japan