🤖 AI Summary
This work addresses the challenge of jointly modeling functionality and human-like priors in dexterous robotic grasping. We propose a two-stage vision-driven grasping framework. Methodologically, it is the first to integrate Negative Functional-Aware segmentation (NAA) with human-inspired motion priors, enabling teacher–student policy transfer via trajectory imitation pretraining, residual adaptation modules, and privileged information distillation. Our key contributions are: (i) explicit modeling of non-functional object regions using NAA, coupled with motion priors to guide functional and anthropomorphically plausible contact point and pose selection; and (ii) distillation of multimodal privileged information—such as geometric and functional annotations—into the visual policy, substantially improving generalization across seen, unseen, and novel object categories. Experiments demonstrate state-of-the-art performance in grasp success rate, anthropomorphic pose fidelity, and functional plausibility.
📝 Abstract
A dexterous hand capable of generalizable grasping objects is fundamental for the development of general-purpose embodied AI. However, previous methods focus narrowly on low-level grasp stability metrics, neglecting affordance-aware positioning and human-like poses which are crucial for downstream manipulation. To address these limitations, we propose AffordDex, a novel framework with two-stage training that learns a universal grasping policy with an inherent understanding of both motion priors and object affordances. In the first stage, a trajectory imitator is pre-trained on a large corpus of human hand motions to instill a strong prior for natural movement. In the second stage, a residual module is trained to adapt these general human-like motions to specific object instances. This refinement is critically guided by two components: our Negative Affordance-aware Segmentation (NAA) module, which identifies functionally inappropriate contact regions, and a privileged teacher-student distillation process that ensures the final vision-based policy is highly successful. Extensive experiments demonstrate that AffordDex not only achieves universal dexterous grasping but also remains remarkably human-like in posture and functionally appropriate in contact location. As a result, AffordDex significantly outperforms state-of-the-art baselines across seen objects, unseen instances, and even entirely novel categories.