🤖 AI Summary
Training dexterous robotic grasping systems is complex, exhibits poor generalization, and struggles to scale to large-scale heterogeneous object sets. Method: We propose UniGraspTransformer—a unified dexterous grasping policy network based on the Transformer architecture. It first pretrains object-specific policies via reinforcement learning, then distills them into a single large-scale unified model via behavior cloning, introducing the novel “simplified policy distillation” paradigm that eliminates conventional multi-stage training. The model supports up to 12 self-attention layers and fuses state-based and visual modalities. Results: In vision-driven settings, UniGraspTransformer achieves absolute gains of 3.5%, 7.7%, and 10.1% in grasp success rates on seen, intra-class unseen, and completely unseen objects, respectively—outperforming UniDexGrasp++. These results demonstrate superior cross-object, cross-category, and zero-shot generalization capabilities.
📝 Abstract
We introduce UniGraspTransformer, a universal Transformer-based network for dexterous robotic grasping that simplifies training while enhancing scalability and performance. Unlike prior methods such as UniDexGrasp++, which require complex, multi-step training pipelines, UniGraspTransformer follows a streamlined process: first, dedicated policy networks are trained for individual objects using reinforcement learning to generate successful grasp trajectories; then, these trajectories are distilled into a single, universal network. Our approach enables UniGraspTransformer to scale effectively, incorporating up to 12 self-attention blocks for handling thousands of objects with diverse poses. Additionally, it generalizes well to both idealized and real-world inputs, evaluated in state-based and vision-based settings. Notably, UniGraspTransformer generates a broader range of grasping poses for objects in various shapes and orientations, resulting in more diverse grasp strategies. Experimental results demonstrate significant improvements over state-of-the-art, UniDexGrasp++, across various object categories, achieving success rate gains of 3.5%, 7.7%, and 10.1% on seen objects, unseen objects within seen categories, and completely unseen objects, respectively, in the vision-based setting. Project page: https://dexhand.github.io/UniGraspTransformer.