🤖 AI Summary
This work addresses the challenge of poor generalization in imitation learning for robotic manipulation when faced with geometrically diverse objects, stemming from limited data diversity. To overcome this, the authors propose a novel approach that integrates 3D generative models with vision foundation models. By establishing semantic keypoint correspondences across large-scale 3D meshes, the method automatically synthesizes diverse, affordance-aware manipulation trajectories for training closed-loop visuomotor policies. This is the first framework to combine large-scale 3D affordance correspondence with generative modeling, enabling zero-shot cross-object manipulation generalization. The approach achieves high success rates in both simulation and real-world environments, significantly improving data efficiency and generalization capability.
📝 Abstract
Despite the recent success of modern imitation learning methods in robot manipulation, their performance is often constrained by geometric variations due to limited data diversity. Leveraging powerful 3D generative models and vision foundation models (VFMs), the proposed AffordGen framework overcomes this limitation by utilizing the semantic correspondence of meaningful keypoints across large-scale 3D meshes to generate new robot manipulation trajectories. This large-scale, affordance-aware dataset is then used to train a robust, closed-loop visuomotor policy, combining the semantic generalizability of affordances with the reactive robustness of end-to-end learning. Experiments in simulation and the real world show that policies trained with AffordGen achieve high success rates and enable zero-shot generalization to truly unseen objects, significantly improving data efficiency in robot learning.