ManipGPT: Is Affordance Segmentation by Large Vision Models Enough for Articulated Object Manipulation?

📅 2024-12-13

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

178K/year

🤖 AI Summary

Traditional operability prediction methods for dexterous manipulation of articulated objects suffer from high computational cost, poor generalization, and limited adaptability to dynamic environments. Method: This paper proposes a lightweight vision-language model–driven, part-level affordance segmentation framework. It adapts ViT’s contextual segmentation capability to robotic manipulation tasks; constructs a compact, cross-modal (simulation + real) dataset of 9.9k samples to mitigate the sim-to-real gap; and decouples perception from control—using affordance masks to guide impedance-adaptive policies for end-to-end manipulation. Contribution/Results: Experiments demonstrate high-robustness part-level segmentation in both simulation and real-world settings, significantly reducing perceptual computation overhead. The approach eliminates reliance on complex point-cloud processing or large-scale manual annotations, enabling efficient, low-resource manipulation across diverse articulated objects.

Technology Category

Application Category

📝 Abstract

Visual actionable affordance has emerged as a transformative approach in robotics, focusing on perceiving interaction areas prior to manipulation. Traditional methods rely on pixel sampling to identify successful interaction samples or processing pointclouds for affordance mapping. However, these approaches are computationally intensive and struggle to adapt to diverse and dynamic environments. This paper introduces ManipGPT, a framework designed to predict optimal interaction areas for articulated objects using a large pre-trained vision transformer (ViT). We created a dataset of 9.9k simulated and real images to bridge the sim-to-real gap and enhance real-world applicability. By fine-tuning the vision transformer on this small dataset, we significantly improved part-level affordance segmentation, adapting the model's in-context segmentation capabilities to robot manipulation scenarios. This enables effective manipulation across simulated and real-world environments by generating part-level affordance masks, paired with an impedance adaptation policy, sufficiently eliminating the need for complex datasets or perception systems.

Problem

Research questions and friction points this paper is trying to address.

Predict optimal interaction areas for articulated objects using vision transformers

Improve part-level affordance segmentation for robot manipulation scenarios

Enable effective manipulation across simulated and real-world environments

Innovation

Methods, ideas, or system contributions that make the work stand out.

Fine-tunes vision transformer for affordance segmentation

Generates part-level masks with impedance adaptation policy

Uses sim-to-real dataset to bridge visual gap

🔎 Similar Papers

UniAff: A Unified Representation of Affordances for Tool Usage and Articulation with Vision-Language Models