ManipGPT: Is Affordance Segmentation by Large Vision Models Enough for Articulated Object Manipulation?

📅 2024-12-13
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Traditional operability prediction methods for dexterous manipulation of articulated objects suffer from high computational cost, poor generalization, and limited adaptability to dynamic environments. Method: This paper proposes a lightweight vision-language model–driven, part-level affordance segmentation framework. It adapts ViT’s contextual segmentation capability to robotic manipulation tasks; constructs a compact, cross-modal (simulation + real) dataset of 9.9k samples to mitigate the sim-to-real gap; and decouples perception from control—using affordance masks to guide impedance-adaptive policies for end-to-end manipulation. Contribution/Results: Experiments demonstrate high-robustness part-level segmentation in both simulation and real-world settings, significantly reducing perceptual computation overhead. The approach eliminates reliance on complex point-cloud processing or large-scale manual annotations, enabling efficient, low-resource manipulation across diverse articulated objects.

Technology Category

Application Category

📝 Abstract
Visual actionable affordance has emerged as a transformative approach in robotics, focusing on perceiving interaction areas prior to manipulation. Traditional methods rely on pixel sampling to identify successful interaction samples or processing pointclouds for affordance mapping. However, these approaches are computationally intensive and struggle to adapt to diverse and dynamic environments. This paper introduces ManipGPT, a framework designed to predict optimal interaction areas for articulated objects using a large pre-trained vision transformer (ViT). We created a dataset of 9.9k simulated and real images to bridge the sim-to-real gap and enhance real-world applicability. By fine-tuning the vision transformer on this small dataset, we significantly improved part-level affordance segmentation, adapting the model's in-context segmentation capabilities to robot manipulation scenarios. This enables effective manipulation across simulated and real-world environments by generating part-level affordance masks, paired with an impedance adaptation policy, sufficiently eliminating the need for complex datasets or perception systems.
Problem

Research questions and friction points this paper is trying to address.

Predict optimal interaction areas for articulated objects using vision transformers
Improve part-level affordance segmentation for robot manipulation scenarios
Enable effective manipulation across simulated and real-world environments
Innovation

Methods, ideas, or system contributions that make the work stand out.

Fine-tunes vision transformer for affordance segmentation
Generates part-level masks with impedance adaptation policy
Uses sim-to-real dataset to bridge visual gap
Taewhan Kim
Taewhan Kim
Seoul National University, Department of Electrical and Computer Engineering
Electronic Design Automation
H
Hojin Bae
CFCS, School of CS, Peking University, Beijing, 100091, China
Zeming Li
Zeming Li
Hong Kong University of Science and Technology (HKUST)
Computer VisionDeep Learning
X
Xiaoqi Li
CFCS, School of CS, Peking University, Beijing, 100091, China
I
Iaroslav Ponomarenko
CFCS, School of CS, Peking University, Beijing, 100091, China
Ruihai Wu
Ruihai Wu
Peking University
computer visionrobotics
H
Hao Dong
CFCS, School of CS, Peking University, Beijing, 100091, China