UniGraspTransformer: Simplified Policy Distillation for Scalable Dexterous Robotic Grasping

📅 2024-12-03
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Training dexterous robotic grasping systems is complex, exhibits poor generalization, and struggles to scale to large-scale heterogeneous object sets. Method: We propose UniGraspTransformer—a unified dexterous grasping policy network based on the Transformer architecture. It first pretrains object-specific policies via reinforcement learning, then distills them into a single large-scale unified model via behavior cloning, introducing the novel “simplified policy distillation” paradigm that eliminates conventional multi-stage training. The model supports up to 12 self-attention layers and fuses state-based and visual modalities. Results: In vision-driven settings, UniGraspTransformer achieves absolute gains of 3.5%, 7.7%, and 10.1% in grasp success rates on seen, intra-class unseen, and completely unseen objects, respectively—outperforming UniDexGrasp++. These results demonstrate superior cross-object, cross-category, and zero-shot generalization capabilities.

Technology Category

Application Category

📝 Abstract
We introduce UniGraspTransformer, a universal Transformer-based network for dexterous robotic grasping that simplifies training while enhancing scalability and performance. Unlike prior methods such as UniDexGrasp++, which require complex, multi-step training pipelines, UniGraspTransformer follows a streamlined process: first, dedicated policy networks are trained for individual objects using reinforcement learning to generate successful grasp trajectories; then, these trajectories are distilled into a single, universal network. Our approach enables UniGraspTransformer to scale effectively, incorporating up to 12 self-attention blocks for handling thousands of objects with diverse poses. Additionally, it generalizes well to both idealized and real-world inputs, evaluated in state-based and vision-based settings. Notably, UniGraspTransformer generates a broader range of grasping poses for objects in various shapes and orientations, resulting in more diverse grasp strategies. Experimental results demonstrate significant improvements over state-of-the-art, UniDexGrasp++, across various object categories, achieving success rate gains of 3.5%, 7.7%, and 10.1% on seen objects, unseen objects within seen categories, and completely unseen objects, respectively, in the vision-based setting. Project page: https://dexhand.github.io/UniGraspTransformer.
Problem

Research questions and friction points this paper is trying to address.

Simplifies training for dexterous robotic grasping
Enhances scalability with universal Transformer-based network
Improves grasp success rates across diverse object categories
Innovation

Methods, ideas, or system contributions that make the work stand out.

Transformer-based network for robotic grasping
Simplified training with policy distillation
Scales to thousands of diverse objects
🔎 Similar Papers
No similar papers found.
W
Wenbo Wang
University of Sydney
Fangyun Wei
Fangyun Wei
Microsoft Research
Computer VisionDeep LearningGenerative Models
L
Lei Zhou
National University of Singapore
X
Xi Chen
Microsoft Research Asia
Lin Luo
Lin Luo
Microsoft Research Asia
X
Xiaohan Yi
Microsoft Research Asia
Y
Yizhong Zhang
Microsoft Research Asia
Yaobo Liang
Yaobo Liang
microsoft.com
Embodied AINatural Language ProcessingAI Agent
C
Chang Xu
University of Sydney
Y
Yan Lu
Microsoft Research Asia
Jiaolong Yang
Jiaolong Yang
Microsoft Research
3D Computer Vision
Baining Guo
Baining Guo
Distinguished Scientist, Microsoft Research
Computer GraphicsGraphicsVirtual RealityGeometric Modeling