🤖 AI Summary
Traditional operability prediction methods for dexterous manipulation of articulated objects suffer from high computational cost, poor generalization, and limited adaptability to dynamic environments. Method: This paper proposes a lightweight vision-language model–driven, part-level affordance segmentation framework. It adapts ViT’s contextual segmentation capability to robotic manipulation tasks; constructs a compact, cross-modal (simulation + real) dataset of 9.9k samples to mitigate the sim-to-real gap; and decouples perception from control—using affordance masks to guide impedance-adaptive policies for end-to-end manipulation. Contribution/Results: Experiments demonstrate high-robustness part-level segmentation in both simulation and real-world settings, significantly reducing perceptual computation overhead. The approach eliminates reliance on complex point-cloud processing or large-scale manual annotations, enabling efficient, low-resource manipulation across diverse articulated objects.
📝 Abstract
Visual actionable affordance has emerged as a transformative approach in robotics, focusing on perceiving interaction areas prior to manipulation. Traditional methods rely on pixel sampling to identify successful interaction samples or processing pointclouds for affordance mapping. However, these approaches are computationally intensive and struggle to adapt to diverse and dynamic environments. This paper introduces ManipGPT, a framework designed to predict optimal interaction areas for articulated objects using a large pre-trained vision transformer (ViT). We created a dataset of 9.9k simulated and real images to bridge the sim-to-real gap and enhance real-world applicability. By fine-tuning the vision transformer on this small dataset, we significantly improved part-level affordance segmentation, adapting the model's in-context segmentation capabilities to robot manipulation scenarios. This enables effective manipulation across simulated and real-world environments by generating part-level affordance masks, paired with an impedance adaptation policy, sufficiently eliminating the need for complex datasets or perception systems.